Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-793

change how monitord parses job output and error files

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: master, 4.4.0
    • Fix Version/s: 4.5.0, 4.4.1
    • Component/s: Monitord
    • Labels:
      None

      Description

      monitord failed for One of the LIGO runs with this message

      LL_DATA_ID000003.000/lalapps_inspiral_ID002789.err.001, continuing...
      2014-09-13 20:52:28,043:job.py:extract_job_info:395: WARNING: unable to read error file: /usr1/cbiwer/log/H1L1V1-s6d_ihope_ssipe-968543943-3078144.4iwM3F/full_data/inspiral_hipe_full_dat
      a.FULL_DATA_ID000003.000/lalapps_inspiral_ID004944.err.000, continuing...
      2014-09-13 20:52:30,329:job.py:extract_job_info:395: WARNING: unable to read error file: /usr1/cbiwer/log/H1L1V1-s6d_ihope_ssipe-968543943-3078144.4iwM3F/nsbhloginj/inspiral_hipe_nsbhlog
      inj.NSBHLOGINJ_ID000010.000/lalapps_inspiral_ID000477.err.001, continuing...
      2014-09-13 20:52:40,415:job.py:extract_job_info:395: WARNING: unable to read error file: /usr1/cbiwer/log/H1L1V1-s6d_ihope_ssipe-968543943-3078144.4iwM3F/bbhlininj/inspiral_hipe_bbhlinin
      j.BBHLININJ_ID000006.000/lalapps_inspiral_ID004579.err.000, continuing...
      2014-09-13 20:52:40,451:job.py:extract_job_info:395: WARNING: unable to read error file: /usr1/cbiwer/log/H1L1V1-s6d_ihope_ssipe-968543943-3078144.4iwM3F/bbhlininj/inspiral_hipe_bbhlinin
      j.BBHLININJ_ID000006.000/lalapps_inspiral_ID000920.err.001, continuing...
      Traceback (most recent call last):
        File "/usr/bin/pegasus-monitord", line 1355, in <module>
          process_output = process_dagman_out(workflow_entry.wf, workflow_entry.ml_buffer[0:ml_pos])
        File "/usr/bin/pegasus-monitord", line 752, in process_dagman_out
          add(wf, my_jobid, "JOB_FAILURE", sched_id=my_sched_id, status=my_jobstatus)
        File "/usr/bin/pegasus-monitord", line 590, in add
          wf.update_job_state(jobid, sched_id, my_job_submit_seq, event, status, my_time)
        File "/usr/lib64/python2.6/site-packages/Pegasus/monitoring/workflow.py", line 1981, in update_job_state
          self.parse_job_output(my_job, job_state)
        File "/usr/lib64/python2.6/site-packages/Pegasus/monitoring/workflow.py", line 1695, in parse_job_output
          my_pegasuslite_ec = self.get_pegasuslite_exitcode( my_job );
        File "/usr/lib64/python2.6/site-packages/Pegasus/monitoring/workflow.py", line 1832, in get_pegasuslite_exitcode
          f = open(errfile)
      IOError: [Errno 2] No such file or directory: '/usr1/cbiwer/log/H1L1V1-s6d_ihope_ssipe-968543943-3078144.4iwM3F/bnsloginj/inspiral_hipe_bnsloginj.BNSLOGINJ_ID000008.000/lalapps_inspiral_
      ID000847.err'

      The above should not happen as all jobs are launched via kickstart. So it should be looking for a err.00X file. Investigation of the code reveals a race condition, as monitord tries to parse the .out and .err file when a JOB_FAILURE or JOB_SUCCESS happens, instead of doing it at POST_SCRIPT_SUCCESS or POST_SCRIPT_FAILURE message, if a postscript is associated .

        Attachments

          Activity

            People

            • Assignee:
              mayani Rajiv Mayani
              Reporter:
              dbrown Duncan Brown
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: