If dagman exits, uncleanly then monitord on resubmission of the workflow can exit prematurely, which causes the monitoring database to go inconsistent state, and cryptic errors where monitord fails with messages such as unable to find key.
corbusier:run0007 vahi$ more monitord.log.002
2017-08-15 18:14:07,801:INFO:pegasus-monitord(68): pegasus-monitord starting - pid 36662
2017-08-15 18:14:07,811:INFO:pegasus-monitord(319): Final Command line options are: ['-N', 'blackdiamond-0.dag.dagman.out']
2017-08-15 18:14:07,864:INFO:Pegasus.monitoring.event_output(315): output type=DB namespace=stampede. name=sqlite:////Volumes/Work/lfs1/work/monitord/bugs/PM-1209/condor-blackdiamond-condorio/dags/vahi/pegasus/blackdiamond/run0007/blackdiamond-0.stampede.db
2017-08-15 18:14:07,907:INFO:Pegasus.monitoring.event_output(315): output type=DB namespace=dashboard. name=sqlite:////Users/vahi/.pegasus/workflow.db
2017-08-15 18:14:07,907:INFO:pegasus-monitord(1010): monitord started in fast start mode
2017-08-15 18:14:07,920:INFO:Pegasus.monitoring.workflow(948): Appending to exisitng jobstate.log replay_mode 0 previous_processed_line 0
2017-08-15 18:14:07,921:INFO:Pegasus.monitoring.notifications(394): loading notifications from /Volumes/Work/lfs1/work/monitord/bugs/PM-1209/condor-blackdiamond-condorio/dags/vahi/pegasus/blackdiamond/run0007/blackdiamond-0.notify
2017-08-15 18:14:07,923:INFO:pegasus-monitord(813): Enabling DAGMAN RECOVERY MODE
2017-08-15 18:14:07,927:INFO:pegasus-monitord(637): DONE with DAGMAN RECOVERY MODE
Traceback (most recent call last):
File "/Volumes/Work/lfs1/software/install/pegasus/pegasus-4.8.0dev/bin/pegasus-monitord", line 1257, in <module>
process_output = process_dagman_out(workflow_entry.wf, workflow_entry.ml_buffer[0:ml_pos])
File "/Volumes/Work/lfs1/software/install/pegasus/pegasus-4.8.0dev/bin/pegasus-monitord", line 659, in process_dagman_out
add(wf, my_expr.group(1), "DAGMAN_SUBMIT")
File "/Volumes/Work/lfs1/software/install/pegasus/pegasus-4.8.0dev/bin/pegasus-monitord", line 500, in add
my_job_submit_seq = wf.add_job(jobid, event)
File "/Volumes/Work/lfs1/software/install/pegasus/pegasus-4.8.0dev/lib/pegasus/python/Pegasus/monitoring/workflow.py", line 2063, in add_job
job_submit_dir = self.determine_job_submit_directory(jobid, self._job_info[jobid][0])
KeyError: 'preprocess_j1'
this happens because the .dag file is not parsed, as monitord skips lines in dagman.out thinking it is a rescue dag case, and starting from the monitord_dagman_out_sequence entry in the monitord.info file.
- blocks
-
PM-1171 Monitord regularly produces empty stderr and stdout files
- Resolved