Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-387

monitord dies on a successful workflow run

XMLWordPrintable

    • Type: Icon: Bug Bug
    • Resolution: Won't Fix
    • Priority: Icon: Major Major
    • master, 3.0.3, 3.1
    • Affects Version/s: 3.0.3
    • Component/s: Monitord
    • None

      We did a LIGO run from the latest 3.0 branch ( 3.0.3cvs )

      Looking at the the outer level workflow, I see
      [vahi@sugar H1L1-s6c_lowmass_ihope-956707143-86400.7xhgOx]$ tail *dagman.out
      05/13 05:14:37 === === === === === === ===
      05/13 05:14:37 86 0 0 0 0 0 0
      05/13 05:14:37 Note: 26 total PRE script deferrals because of -MaxPre limit (1)
      05/13 05:14:37 All jobs Completed!
      05/13 05:14:37 Note: 0 total job deferrals because of -MaxJobs limit (5000)
      05/13 05:14:37 Note: 0 total job deferrals because of -MaxIdle limit (2000)
      05/13 05:14:37 Note: 0 total job deferrals because of node category throttles
      05/13 05:14:37 Note: 26 total PRE script deferrals because of -MaxPre limit (1)
      05/13 05:14:37 Note: 0 total POST script deferrals because of -MaxPost limit (20)
      05/13 05:14:37 **** condor_scheduniv_exec.12055169.0 (condor_DAGMAN) pid 5025 EXITING WITH STATUS 0

      DAGMan completed the workflow successfully

      However the jobstate.log does not indicate completion
      [vahi@sugar H1L1-s6c_lowmass_ihope-956707143-86400.7xhgOx]$ tail jobstate.log
      1305277874 subdag_plot_hipe_bnslininj_summary_plots_cat_4_veto.BNSLININJ_SUMMARY_PLOTS_CAT_4_VETO_ID000053 EXECUTE 12154987.0 - - 74
      1305277874 subdag_plot_hipe_bnslininj_summary_plots_cat_3_veto.BNSLININJ_SUMMARY_PLOTS_CAT_3_VETO_ID000052 EXECUTE 12154989.0 - - 75
      1305277944 subdag_plot_hipe_allinj_summary_plots_cat_4_veto.ALLINJ_SUMMARY_PLOTS_CAT_4_VETO_ID000081 JOB_TERMINATED 12154955.0 - - 53
      1305277944 subdag_plot_hipe_allinj_summary_plots_cat_4_veto.ALLINJ_SUMMARY_PLOTS_CAT_4_VETO_ID000081 JOB_SUCCESS 0 - - 53
      1305277944 subdag_plot_hipe_full_data_summary_plots_cat_4_veto.FULL_DATA_SUMMARY_PLOTS_CAT_4_VETO_ID000049 JOB_TERMINATED 12154994.0 - - 77
      1305277944 subdag_plot_hipe_full_data_summary_plots_cat_4_veto.FULL_DATA_SUMMARY_PLOTS_CAT_4_VETO_ID000049 JOB_SUCCESS 0 - - 77
      1305277949 subdag_plot_hipe_allinj_summary_plots_cat_3_veto.ALLINJ_SUMMARY_PLOTS_CAT_3_VETO_ID000080 JOB_TERMINATED 12154956.0 - - 54
      1305277949 subdag_plot_hipe_allinj_summary_plots_cat_3_veto.ALLINJ_SUMMARY_PLOTS_CAT_3_VETO_ID000080 JOB_SUCCESS 0 - - 54
      1305277950 subdag_plot_hipe_full_data_slide_summary_plots_cat_4_veto.FULL_DATA_SLIDE_SUMMARY_PLOTS_CAT_4_VETO_ID000045 JOB_TERMINATED 12154998.0 - - 78
      1305277950 subdag_plot_hipe_full_data_slide_summary_plots_cat_4_veto.FULL_DATA_SLIDE_SUMMARY_PLOTS_CAT_4_VETO_ID000045 JOB_SUCCESS 0 - - 78

      In the monitord log we have this

      [vahi@sugar H1L1-s6c_lowmass_ihope-956707143-86400.7xhgOx]$ more monitord.log
      Traceback (most recent call last):
      File "/home/vahi/SOFTWARE/install/pegasus/default/bin/pegasus-monitord", line 2998, in ?
      new_dagman_out = process(workflow_entry.wf, workflow_entry.ml_buffer[0:ml_pos])
      File "/home/vahi/SOFTWARE/install/pegasus/default/bin/pegasus-monitord", line 2262, in process
      add(wf, my_jobid, my_event, condor_id=my_condor_id)
      File "/home/vahi/SOFTWARE/install/pegasus/default/bin/pegasus-monitord", line 2173, in add
      wf.update_job_state(jobid, my_job_submit_seq, event, status, my_time)
      File "/home/vahi/SOFTWARE/install/pegasus/default/bin/pegasus-monitord", line 1752, in update_job_state
      self.db_send_job_state(my_job)
      File "/home/vahi/SOFTWARE/install/pegasus/default/bin/pegasus-monitord", line 1379, in db_send_job_state
      self.output_to_db("job.state", kwargs)
      File "/home/vahi/SOFTWARE/install/pegasus/default/bin/pegasus-monitord", line 717, in output_to_db
      self._sink.send(event, kwargs)
      File "/home/vahi/SOFTWARE/install/pegasus/default/bin/pegasus-monitord", line 284, in send
      self._db.notify(d)
      File "/home/vahi/SOFTWARE/install/pegasus/default/lib/python/netlogger/analysis/modules/_base.py", line 251, in notify
      raise ProcessException(str(err))
      netlogger.analysis.modules._base.ProcessException: New instance <Jobstate at 0x2b8841a5d0d0> with identity key (<class 'netlogger.analysis.schema.st
      ampede_schema.Jobstate'>, (2798, 'EXECUTE', 1305277946.0, '2')) conflicts with persistent instance <Jobstate at 0x2b8841a5d6d0>

            Assignee:
            fabio Fabio Silva
            Reporter:
            vahi Karan Vahi
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: