change how monitord parses job output and error files

XMLWordPrintable

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major
    • 4.5.0, 4.4.1
    • Affects Version/s: master, 4.4.0
    • Component/s: Monitord
    • None

      monitord failed for One of the LIGO runs with this message

      LL_DATA_ID000003.000/lalapps_inspiral_ID002789.err.001, continuing...
      2014-09-13 20:52:28,043:job.py:extract_job_info:395: WARNING: unable to read error file: /usr1/cbiwer/log/H1L1V1-s6d_ihope_ssipe-968543943-3078144.4iwM3F/full_data/inspiral_hipe_full_dat
      a.FULL_DATA_ID000003.000/lalapps_inspiral_ID004944.err.000, continuing...
      2014-09-13 20:52:30,329:job.py:extract_job_info:395: WARNING: unable to read error file: /usr1/cbiwer/log/H1L1V1-s6d_ihope_ssipe-968543943-3078144.4iwM3F/nsbhloginj/inspiral_hipe_nsbhlog
      inj.NSBHLOGINJ_ID000010.000/lalapps_inspiral_ID000477.err.001, continuing...
      2014-09-13 20:52:40,415:job.py:extract_job_info:395: WARNING: unable to read error file: /usr1/cbiwer/log/H1L1V1-s6d_ihope_ssipe-968543943-3078144.4iwM3F/bbhlininj/inspiral_hipe_bbhlinin
      j.BBHLININJ_ID000006.000/lalapps_inspiral_ID004579.err.000, continuing...
      2014-09-13 20:52:40,451:job.py:extract_job_info:395: WARNING: unable to read error file: /usr1/cbiwer/log/H1L1V1-s6d_ihope_ssipe-968543943-3078144.4iwM3F/bbhlininj/inspiral_hipe_bbhlinin
      j.BBHLININJ_ID000006.000/lalapps_inspiral_ID000920.err.001, continuing...
      Traceback (most recent call last):
      File "/usr/bin/pegasus-monitord", line 1355, in <module>
      process_output = process_dagman_out(workflow_entry.wf, workflow_entry.ml_buffer[0:ml_pos])
      File "/usr/bin/pegasus-monitord", line 752, in process_dagman_out
      add(wf, my_jobid, "JOB_FAILURE", sched_id=my_sched_id, status=my_jobstatus)
      File "/usr/bin/pegasus-monitord", line 590, in add
      wf.update_job_state(jobid, sched_id, my_job_submit_seq, event, status, my_time)
      File "/usr/lib64/python2.6/site-packages/Pegasus/monitoring/workflow.py", line 1981, in update_job_state
      self.parse_job_output(my_job, job_state)
      File "/usr/lib64/python2.6/site-packages/Pegasus/monitoring/workflow.py", line 1695, in parse_job_output
      my_pegasuslite_ec = self.get_pegasuslite_exitcode( my_job );
      File "/usr/lib64/python2.6/site-packages/Pegasus/monitoring/workflow.py", line 1832, in get_pegasuslite_exitcode
      f = open(errfile)
      IOError: [Errno 2] No such file or directory: '/usr1/cbiwer/log/H1L1V1-s6d_ihope_ssipe-968543943-3078144.4iwM3F/bnsloginj/inspiral_hipe_bnsloginj.BNSLOGINJ_ID000008.000/lalapps_inspiral_
      ID000847.err'

      The above should not happen as all jobs are launched via kickstart. So it should be looking for a err.00X file. Investigation of the code reveals a race condition, as monitord tries to parse the .out and .err file when a JOB_FAILURE or JOB_SUCCESS happens, instead of doing it at POST_SCRIPT_SUCCESS or POST_SCRIPT_FAILURE message, if a postscript is associated .

            Assignee:
            Rajiv Mayani
            Reporter:
            Duncan Brown
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: