Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-486

monitord trips over malformed seqexec output

XMLWordPrintable

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major Major
    • None
    • Affects Version/s: master, 3.1
    • Component/s: Monitord, Worker Tools
    • None

      I was going through a test LIGO run from end June. When I ran monitord in replay mode, I saw messages like these

      1-09-20 15:15:17,204:pegasus-monitord:parse_in_file:966: WARNING: adc54502-f8d6-4cdf-aa0d-7d59d4b62348 /lfs1/work/ligo/3.1.0/H1L1-s6c_lowmass_ihope-956707143-86400.wjIT2Z/bbhloginj/inspiral_hipe_bbhloginj_cat3_veto.BBHLOGINJ_CAT_3_VETO_ID000023.000/inspiral_hipe_bbhloginj_cat3_veto.BBHLOGINJ_CAT_3_VETO-0.dag merge_ligo-lalapps_sire-1.0_PID3_ID1 cannot locate task 19 in dictionary... skipping this task
      2011-09-20 15:15:17,204:pegasus-monitord:parse_in_file:989: WARNING: adc54502-f8d6-4cdf-aa0d-7d59d4b62348 /lfs1/work/ligo/3.1.0/H1L1-s6c_lowmass_ihope-956707143-86400.wjIT2Z/bbhloginj/inspiral_hipe_bbhloginj_cat3_veto.BBHLOGINJ_CAT_3_VETO_ID000023.000/inspiral_hipe_bbhloginj_cat3_veto.BBHLOGINJ_CAT_3_VETO-0.dag merge_ligo-lalapps_sire-1.0_PID3_ID1 - 2 - cannot locate task 19 in dictionary... skipping this task
      2011-09-20 15:15:17,204:pegasus-monitord:parse_in_file:966: WARNING: adc54502-f8d6-4cdf-aa0d-7d59d4b62348 /lfs1/work/ligo/3.1.0/H1L1-s6c_lowmass_ihope-956707143-86400.wjIT2Z/bbhloginj/inspiral_hipe_bbhloginj_cat3_veto.BBHLOGINJ_CAT_3_VETO_ID000023.000/inspiral_hipe_bbhloginj_cat3_veto.BBHLOGINJ_CAT_3_VETO-0.dag merge_ligo-lalapps_sire-1.0_PID3_ID1 cannot locate task 20 in dictionary... skipping this task

      On tracing further, the seqexec job has malformed output.

      corbusier:inspiral_hipe_nsbhloginj_cat4_veto.NSBHLOGINJ_CAT_4_VETO_ID000037.000 vahi$ cat merge_ligo-lalapps_sire-1.0_PID3_ID1.out
      condor_exec.exe: unable to become process group leader: 1: Operation not permitted (ignoring)
      [seqexec-task id=1, start="2011-06-27T20:12:07.718-04:00", duration=0.131, status=0, line=2, pid=10409, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=2, start="2011-06-27T20:12:07.850-04:00", duration=0.143, status=0, line=4, pid=10410, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=3, start="2011-06-27T20:12:07.993-04:00", duration=0.173, status=0, line=6, pid=10412, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=4, start="2011-06-27T20:12:08.167-04:00", duration=0.151, status=0, line=8, pid=10414, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=5, start="2011-06-27T20:12:08.319-04:00", duration=0.167, status=0, line=10, pid=10416, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=6, start="2011-06-27T20:12:08.486-04:00", duration=0.141, status=0, line=12, pid=10418, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=7, start="2011-06-27T20:12:08.628-04:00", duration=0.149, status=0, line=14, pid=10420, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=8, start="2011-06-27T20:12:08.778-04:00", duration=0.121, status=0, line=16, pid=10422, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=9, start="2011-06-27T20:12:08.900-04:00", duration=0.122, status=0, line=18, pid=10423, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=10, start="2011-06-27T20:12:09.023-04:00", duration=0.105, status=0, line=20, pid=10424, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=11, start="2011-06-27T20:12:09.128-04:00", duration=0.105, status=0, line=22, pid=10425, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=12, start="2011-06-27T20:12:09.233-04:00", duration=0.107, status=0, line=24, pid=10426, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=13, start="2011-06-27T20:12:09.341-04:00", duration=0.109, status=0, line=26, pid=10427, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=14, start="2011-06-27T20:12:09.450-04:00", duration=0.107, status=0, line=28, pid=10428, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=15, start="2011-06-27T20:12:09.558-04:00", duration=0.105, status=0, line=30, pid=10429, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=16, start="2011-06-27T20:12:09.663-04:00", duration=0.106, status=0, line=32, pid=10430, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=17, start="2011-06-27T20:12:09.769-04:00", duration=0.108, status=0, line=34, pid=10431, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=18, start="2011-06-27T20:12:09.878-04:00", duration=0.102, status=0, line=36, pid=10432, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=19, start="2011-06-27T20:12:09.980-04:00", duration=0.108, status=0, line=38, pid=10433, app="/home/dbrown/projectcondor_exec.exe: 1 task remaining
      s/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [seqexec-task id=20, start="2011-06-27T20:12:10.090-04:00", duration=0.105, status=0, line=40, pid=10434, app="/home/dbrown/projects/cbc/dax/ihope-dax3.2/test3/956707143-956793543/nsbhloginj/../executables/lalapps_sire"]
      [condor_exec.exe-summary stat="ok", lines=40, tasks=20, succeeded=20, failed=0, extra=0, duration=2.476, start="2011-06-27T20:12:07.718-04:00", pid=10407, app="condor_exec.exe"]

      We need to fix this in seqexec, and also if possible make the parser in monitord more resilient.

            Assignee:
            fabio Fabio Silva
            Reporter:
            vahi Karan Vahi
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: