Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1217

monitord exits prematurely, when in dagman recovery mode

XMLWordPrintable

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major Major
    • master, 4.8.0, 4.7.5
    • Affects Version/s: master, 4.7.4
    • Component/s: Monitord
    • None

      If dagman exits, uncleanly then monitord on resubmission of the workflow can exit prematurely, which causes the monitoring database to go inconsistent state, and cryptic errors where monitord fails with messages such as unable to find key.

      corbusier:run0007 vahi$ more monitord.log.002
      2017-08-15 18:14:07,801:INFO:pegasus-monitord(68): pegasus-monitord starting - pid 36662
      2017-08-15 18:14:07,811:INFO:pegasus-monitord(319): Final Command line options are: ['-N', 'blackdiamond-0.dag.dagman.out']
      2017-08-15 18:14:07,864:INFO:Pegasus.monitoring.event_output(315): output type=DB namespace=stampede. name=sqlite:////Volumes/Work/lfs1/work/monitord/bugs/PM-1209/condor-blackdiamond-condorio/dags/vahi/pegasus/blackdiamond/run0007/blackdiamond-0.stampede.db
      2017-08-15 18:14:07,907:INFO:Pegasus.monitoring.event_output(315): output type=DB namespace=dashboard. name=sqlite:////Users/vahi/.pegasus/workflow.db
      2017-08-15 18:14:07,907:INFO:pegasus-monitord(1010): monitord started in fast start mode
      2017-08-15 18:14:07,920:INFO:Pegasus.monitoring.workflow(948): Appending to exisitng jobstate.log replay_mode 0 previous_processed_line 0
      2017-08-15 18:14:07,921:INFO:Pegasus.monitoring.notifications(394): loading notifications from /Volumes/Work/lfs1/work/monitord/bugs/PM-1209/condor-blackdiamond-condorio/dags/vahi/pegasus/blackdiamond/run0007/blackdiamond-0.notify
      2017-08-15 18:14:07,923:INFO:pegasus-monitord(813): Enabling DAGMAN RECOVERY MODE
      2017-08-15 18:14:07,927:INFO:pegasus-monitord(637): DONE with DAGMAN RECOVERY MODE
      Traceback (most recent call last):
      File "/Volumes/Work/lfs1/software/install/pegasus/pegasus-4.8.0dev/bin/pegasus-monitord", line 1257, in <module>
      process_output = process_dagman_out(workflow_entry.wf, workflow_entry.ml_buffer[0:ml_pos])
      File "/Volumes/Work/lfs1/software/install/pegasus/pegasus-4.8.0dev/bin/pegasus-monitord", line 659, in process_dagman_out
      add(wf, my_expr.group(1), "DAGMAN_SUBMIT")
      File "/Volumes/Work/lfs1/software/install/pegasus/pegasus-4.8.0dev/bin/pegasus-monitord", line 500, in add
      my_job_submit_seq = wf.add_job(jobid, event)
      File "/Volumes/Work/lfs1/software/install/pegasus/pegasus-4.8.0dev/lib/pegasus/python/Pegasus/monitoring/workflow.py", line 2063, in add_job
      job_submit_dir = self.determine_job_submit_directory(jobid, self._job_info[jobid][0])
      KeyError: 'preprocess_j1'

      this happens because the .dag file is not parsed, as monitord skips lines in dagman.out thinking it is a rescue dag case, and starting from the monitord_dagman_out_sequence entry in the monitord.info file.

            Assignee:
            vahi Karan Vahi
            Reporter:
            vahi Karan Vahi
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: