-
Type: Bug
-
Resolution: Fixed
-
Priority: Major
-
None
-
Affects Version/s: master, 3.1
-
Component/s: Monitord, Worker Tools
-
None
I was going through a test LIGO run from end June. When I ran monitord in replay mode, I saw messages like these
1-09-20 15:15:17,204:pegasus-monitord:parse_in_file:966: WARNING: adc54502-f8d6-4cdf-aa0d-7d59d4b62348 /lfs1/work/ligo/3.1.0/H1L1-s6c_lowmass_ihope-956707143-86400.wjIT2Z/bbhloginj/inspiral_hipe_bbhloginj_cat3_veto.BBHLOGINJ_CAT_3_VETO_ID000023.000/inspiral_hipe_bbhloginj_cat3_veto.BBHLOGINJ_CAT_3_VETO-0.dag merge_ligo-lalapps_sire-1.0_PID3_ID1 cannot locate task 19 in dictionary... skipping this task
2011-09-20 15:15:17,204:pegasus-monitord:parse_in_file:989: WARNING: adc54502-f8d6-4cdf-aa0d-7d59d4b62348 /lfs1/work/ligo/3.1.0/H1L1-s6c_lowmass_ihope-956707143-86400.wjIT2Z/bbhloginj/inspiral_hipe_bbhloginj_cat3_veto.BBHLOGINJ_CAT_3_VETO_ID000023.000/inspiral_hipe_bbhloginj_cat3_veto.BBHLOGINJ_CAT_3_VETO-0.dag merge_ligo-lalapps_sire-1.0_PID3_ID1 - 2 - cannot locate task 19 in dictionary... skipping this task
2011-09-20 15:15:17,204:pegasus-monitord:parse_in_file:966: WARNING: adc54502-f8d6-4cdf-aa0d-7d59d4b62348 /lfs1/work/ligo/3.1.0/H1L1-s6c_lowmass_ihope-956707143-86400.wjIT2Z/bbhloginj/inspiral_hipe_bbhloginj_cat3_veto.BBHLOGINJ_CAT_3_VETO_ID000023.000/inspiral_hipe_bbhloginj_cat3_veto.BBHLOGINJ_CAT_3_VETO-0.dag merge_ligo-lalapps_sire-1.0_PID3_ID1 cannot locate task 20 in dictionary... skipping this task
On tracing further, the seqexec job has malformed output.
corbusier:inspiral_hipe_nsbhloginj_cat4_veto.NSBHLOGINJ_CAT_4_VETO_ID000037.000 vahi$ cat merge_ligo-lalapps_sire-1.0_PID3_ID1.out
condor_exec.exe: unable to become process group leader: 1: Operation not permitted (ignoring)
[seqexec-task id=1, start="2011-06-27T20:12:07.718-04:00", duration=0.131, status=0, line=2, pid=10409, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=2, start="2011-06-27T20:12:07.850-04:00", duration=0.143, status=0, line=4, pid=10410, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=3, start="2011-06-27T20:12:07.993-04:00", duration=0.173, status=0, line=6, pid=10412, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=4, start="2011-06-27T20:12:08.167-04:00", duration=0.151, status=0, line=8, pid=10414, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=5, start="2011-06-27T20:12:08.319-04:00", duration=0.167, status=0, line=10, pid=10416, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=6, start="2011-06-27T20:12:08.486-04:00", duration=0.141, status=0, line=12, pid=10418, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=7, start="2011-06-27T20:12:08.628-04:00", duration=0.149, status=0, line=14, pid=10420, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=8, start="2011-06-27T20:12:08.778-04:00", duration=0.121, status=0, line=16, pid=10422, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=9, start="2011-06-27T20:12:08.900-04:00", duration=0.122, status=0, line=18, pid=10423, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=10, start="2011-06-27T20:12:09.023-04:00", duration=0.105, status=0, line=20, pid=10424, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=11, start="2011-06-27T20:12:09.128-04:00", duration=0.105, status=0, line=22, pid=10425, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=12, start="2011-06-27T20:12:09.233-04:00", duration=0.107, status=0, line=24, pid=10426, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=13, start="2011-06-27T20:12:09.341-04:00", duration=0.109, status=0, line=26, pid=10427, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=14, start="2011-06-27T20:12:09.450-04:00", duration=0.107, status=0, line=28, pid=10428, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=15, start="2011-06-27T20:12:09.558-04:00", duration=0.105, status=0, line=30, pid=10429, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=16, start="2011-06-27T20:12:09.663-04:00", duration=0.106, status=0, line=32, pid=10430, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=17, start="2011-06-27T20:12:09.769-04:00", duration=0.108, status=0, line=34, pid=10431, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=18, start="2011-06-27T20:12:09.878-04:00", duration=0.102, status=0, line=36, pid=10432, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=19, start="2011-06-27T20:12:09.980-04:00", duration=0.108, status=0, line=38, pid=10433, app="/home/dbrown/projectcondor_exec.exe: 1 task remaining
s/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[seqexec-task id=20, start="2011-06-27T20:12:10.090-04:00", duration=0.105, status=0, line=40, pid=10434, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
[condor_exec.exe-summary stat="ok", lines=40, tasks=20, succeeded=20, failed=0, extra=0, duration=2.476, start="2011-06-27T20:12:07.718-04:00", pid=10407, app="condor_exec.exe"]
We need to fix this in seqexec, and also if possible make the parser in monitord more resilient.