Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1217

monitord exits prematurely, when in dagman recovery mode

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: master, 4.7.4
    • Fix Version/s: master, 4.8.0, 4.7.5
    • Component/s: Monitord
    • Labels:
      None

      Description

      If dagman exits, uncleanly then monitord on resubmission of the workflow can exit prematurely, which causes the monitoring database to go inconsistent state, and cryptic errors where monitord fails with messages such as unable to find key.

      corbusier:run0007 vahi$ more monitord.log.002
      2017-08-15 18:14:07,801:INFO:pegasus-monitord(68): pegasus-monitord starting - pid 36662
      2017-08-15 18:14:07,811:INFO:pegasus-monitord(319): Final Command line options are: ['-N', 'blackdiamond-0.dag.dagman.out']
      2017-08-15 18:14:07,864:INFO:Pegasus.monitoring.event_output(315): output type=DB namespace=stampede. name=sqlite:////Volumes/Work/lfs1/work/monitord/bugs/PM-1209/condor-blackdiamond-condorio/dags/vahi/pegasus/blackdiamond/run0007/blackdiamond-0.stampede.db
      2017-08-15 18:14:07,907:INFO:Pegasus.monitoring.event_output(315): output type=DB namespace=dashboard. name=sqlite:////Users/vahi/.pegasus/workflow.db
      2017-08-15 18:14:07,907:INFO:pegasus-monitord(1010): monitord started in fast start mode
      2017-08-15 18:14:07,920:INFO:Pegasus.monitoring.workflow(948): Appending to exisitng jobstate.log replay_mode 0 previous_processed_line 0
      2017-08-15 18:14:07,921:INFO:Pegasus.monitoring.notifications(394): loading notifications from /Volumes/Work/lfs1/work/monitord/bugs/PM-1209/condor-blackdiamond-condorio/dags/vahi/pegasus/blackdiamond/run0007/blackdiamond-0.notify
      2017-08-15 18:14:07,923:INFO:pegasus-monitord(813): Enabling DAGMAN RECOVERY MODE
      2017-08-15 18:14:07,927:INFO:pegasus-monitord(637): DONE with DAGMAN RECOVERY MODE
      Traceback (most recent call last):
        File "/Volumes/Work/lfs1/software/install/pegasus/pegasus-4.8.0dev/bin/pegasus-monitord", line 1257, in <module>
          process_output = process_dagman_out(workflow_entry.wf, workflow_entry.ml_buffer[0:ml_pos])
        File "/Volumes/Work/lfs1/software/install/pegasus/pegasus-4.8.0dev/bin/pegasus-monitord", line 659, in process_dagman_out
          add(wf, my_expr.group(1), "DAGMAN_SUBMIT")
        File "/Volumes/Work/lfs1/software/install/pegasus/pegasus-4.8.0dev/bin/pegasus-monitord", line 500, in add
          my_job_submit_seq = wf.add_job(jobid, event)
        File "/Volumes/Work/lfs1/software/install/pegasus/pegasus-4.8.0dev/lib/pegasus/python/Pegasus/monitoring/workflow.py", line 2063, in add_job
          job_submit_dir = self.determine_job_submit_directory(jobid, self._job_info[jobid][0])
      KeyError: 'preprocess_j1'

      this happens because the .dag file is not parsed, as monitord skips lines in dagman.out thinking it is a rescue dag case, and starting from the monitord_dagman_out_sequence entry in the monitord.info file.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                vahi Karan Vahi
                Reporter:
                vahi Karan Vahi
              • Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: