monitord failed for One of the LIGO runs with this message
LL_DATA_ID000003.000/lalapps_inspiral_ID002789.err.001, continuing...
2014-09-13 20:52:28,043:job.py:extract_job_info:395: WARNING: unable to read error file: /usr1/cbiwer/log/H1L1V1-s6d_ihope_ssipe-968543943-3078144.4iwM3F/full_data/inspiral_hipe_full_dat
a.FULL_DATA_ID000003.000/lalapps_inspiral_ID004944.err.000, continuing...
2014-09-13 20:52:30,329:job.py:extract_job_info:395: WARNING: unable to read error file: /usr1/cbiwer/log/H1L1V1-s6d_ihope_ssipe-968543943-3078144.4iwM3F/nsbhloginj/inspiral_hipe_nsbhlog
inj.NSBHLOGINJ_ID000010.000/lalapps_inspiral_ID000477.err.001, continuing...
2014-09-13 20:52:40,415:job.py:extract_job_info:395: WARNING: unable to read error file: /usr1/cbiwer/log/H1L1V1-s6d_ihope_ssipe-968543943-3078144.4iwM3F/bbhlininj/inspiral_hipe_bbhlinin
j.BBHLININJ_ID000006.000/lalapps_inspiral_ID004579.err.000, continuing...
2014-09-13 20:52:40,451:job.py:extract_job_info:395: WARNING: unable to read error file: /usr1/cbiwer/log/H1L1V1-s6d_ihope_ssipe-968543943-3078144.4iwM3F/bbhlininj/inspiral_hipe_bbhlinin
j.BBHLININJ_ID000006.000/lalapps_inspiral_ID000920.err.001, continuing...
Traceback (most recent call last):
File "/usr/bin/pegasus-monitord", line 1355, in <module>
process_output = process_dagman_out(workflow_entry.wf, workflow_entry.ml_buffer[0:ml_pos])
File "/usr/bin/pegasus-monitord", line 752, in process_dagman_out
add(wf, my_jobid, "JOB_FAILURE", sched_id=my_sched_id, status=my_jobstatus)
File "/usr/bin/pegasus-monitord", line 590, in add
wf.update_job_state(jobid, sched_id, my_job_submit_seq, event, status, my_time)
File "/usr/lib64/python2.6/site-packages/Pegasus/monitoring/workflow.py", line 1981, in update_job_state
self.parse_job_output(my_job, job_state)
File "/usr/lib64/python2.6/site-packages/Pegasus/monitoring/workflow.py", line 1695, in parse_job_output
my_pegasuslite_ec = self.get_pegasuslite_exitcode( my_job );
File "/usr/lib64/python2.6/site-packages/Pegasus/monitoring/workflow.py", line 1832, in get_pegasuslite_exitcode
f = open(errfile)
IOError: [Errno 2] No such file or directory: '/usr1/cbiwer/log/H1L1V1-s6d_ihope_ssipe-968543943-3078144.4iwM3F/bnsloginj/inspiral_hipe_bnsloginj.BNSLOGINJ_ID000008.000/lalapps_inspiral_
ID000847.err'
The above should not happen as all jobs are launched via kickstart. So it should be looking for a err.00X file. Investigation of the code reveals a race condition, as monitord tries to parse the .out and .err file when a JOB_FAILURE or JOB_SUCCESS happens, instead of doing it at POST_SCRIPT_SUCCESS or POST_SCRIPT_FAILURE message, if a postscript is associated .