prescript events in condor recovery mode result in data integrity errors

XMLWordPrintable

    • Type: Bug
    • Resolution: Won't Fix
    • Priority: Major
    • master, 4.0
    • Affects Version/s: master, 3.1
    • Component/s: Monitord
    • None

      For one of the SCEC runs, we saw events of the following type in the monitord.log

      Directory on shock.usc.edu
      /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800

      2011-10-12 12:33:20,736:nllog.py:log:165: ERROR: ts=2011-10-12T19:33:20.736679Z event=netlogger.analysis.modules.stampede_loader.Analyzer.batch_flush level=Error msg="Integrity error on batch flush: (IntegrityError) (1062, \"Duplicate entry '787857-
      -1' for key 'UNIQUE_INVOCATION'\") 'INSERT INTO invocation (job_instance_id, task_submit_seq, start_time, remote_duration, exitcode, transformation, executable, argv, abs_task_id, wf_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' (787857L, '-1
      ', 1315418427.0, '26', '0', 'dagman::pre', '/opt/pegasus/default/bin/pegasus-plan', ' -Dpegasus.log.*=/home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/subdax_CyberShake_GLBT_76_dax_76.pre.log -Dpegas
      us.workflow.root.uuid=4f322a53-6b86-4b58-9610-c30d877c2c2e --conf /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/pegasus.760.properties --dir /home/scec-02/cybershk/runs/GLBT_PP_dax/dags --relative
      -dir cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/CyberShake_GLBT_76_dax_76 --relative-submit-dir cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/./CyberShake_GLBT_76_dax_76 --sites opensha,ranger,shock --cache /home/scec-02/cybershk/
      runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/./CyberShake_GLBT_pre_preDAX/CyberShake_GLBT_pre-0.cache --inherited-rc-files /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0
      800/CyberShake_GLBT-0.replica.store --cluster horizontal --output shock --force --nocleanup --quiet --quiet --quiet --quiet --quiet --monitor --deferred --group pegasus --rescue 100 --dax /home/scec-02/cybershk/runs/GLBT_PP_dax/CyberShake_G
      LBT_76.dax ', None, 4615L) - batch will need to be committed per-event which will take longer"
      2011-10-12 12:33:28,552:nllog.py:log:165: ERROR: ts=2011-10-12T19:33:28.552493Z event=netlogger.analysis.modules.stampede_loader.Analyzer.individual_commit level=Error msg="Insert failed for event <class 'netlogger.analysis.schema.stampede_schema.In
      vocation'>:

      After running the monitord in a replay mode and creating a bp file, we do see that there are duplicate events
      bash-3.2$ cat CyberShake_GLBT-20110907T071342-0800.bp | grep 4f322a53-6b86-4b58-9610-c30d877c2c2e | grep inv.end | grep subdax_CyberShake_GLBT_76_dax_76
      ts=2011-09-07T15:35:57.000000Z event=stampede.inv.end level=Info xwf.id=4f322a53-6b86-4b58-9610-c30d877c2c2e executable=/opt/pegasus/default/bin/pegasus-plan job_inst.id=23 start_time=1315409668 job.id=subdax_CyberShake_GLBT_76_dax_76 argv=" -Dpegasus.log.*=/home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/subdax_CyberShake_GLBT_76_dax_76.pre.log -Dpegasus.workflow.root.uuid=4f322a53-6b86-4b58-9610-c30d877c2c2e --conf /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/pegasus.760.properties --dir /home/scec-02/cybershk/runs/GLBT_PP_dax/dags --relative-dir cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/CyberShake_GLBT_76_dax_76 --relative-submit-dir cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/./CyberShake_GLBT_76_dax_76 --sites opensha,ranger,shock --cache /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/./CyberShake_GLBT_pre_preDAX/CyberShake_GLBT_pre-0.cache --inherited-rc-files /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/CyberShake_GLBT-0.replica.store --cluster horizontal --output shock --force --nocleanup --quiet --quiet --quiet --quiet --quiet --monitor --deferred --group pegasus --rescue 100 --dax /home/scec-02/cybershk/runs/GLBT_PP_dax/CyberShake_GLBT_76.dax " inv.id=-1 dur=89 transformation=dagman::pre exitcode=0
      ts=2011-09-07T18:00:53.000000Z event=stampede.inv.end level=Info xwf.id=4f322a53-6b86-4b58-9610-c30d877c2c2e executable=/opt/pegasus/default/bin/pegasus-plan job_inst.id=23 start_time=1315418427 job.id=subdax_CyberShake_GLBT_76_dax_76 argv=" -Dpegasus.log.*=/home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/subdax_CyberShake_GLBT_76_dax_76.pre.log -Dpegasus.workflow.root.uuid=4f322a53-6b86-4b58-9610-c30d877c2c2e --conf /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/pegasus.760.properties --dir /home/scec-02/cybershk/runs/GLBT_PP_dax/dags --relative-dir cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/CyberShake_GLBT_76_dax_76 --relative-submit-dir cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/./CyberShake_GLBT_76_dax_76 --sites opensha,ranger,shock --cache /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/./CyberShake_GLBT_pre_preDAX/CyberShake_GLBT_pre-0.cache --inherited-rc-files /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/CyberShake_GLBT-0.replica.store --cluster horizontal --output shock --force --nocleanup --quiet --quiet --quiet --quiet --quiet --monitor --deferred --group pegasus --rescue 100 --dax /home/scec-02/cybershk/runs/GLBT_PP_dax/CyberShake_GLBT_76.dax " inv.id=-1 dur=26 transformation=dagman::pre exitcode=0
      ts=2011-09-09T01:32:24.000000Z event=stampede.inv.end level=Info xwf.id=4f322a53-6b86-4b58-9610-c30d877c2c2e executable=/usr/local/condor/default/bin/condor_dagman job_inst.id=23 start_time=1315420910 job.id=subdax_CyberShake_GLBT_76_dax_76 argv=" -f -l . -Debug 3 -Lockfile CyberShake_GLBT_76-76.dag.lock -Dag CyberShake_GLBT_76-76.dag -AllowVersionMismatch -AutoRescue 1 -DoRescueFrom 0 -MaxPre 2 -MaxIdle 200" task.id=dax_76 inv.id=1 dur=111034 transformation=condor::dagman exitcode=0
      -bash-3.2$

      Looking at dagman.out file, we saw that the prescript was run again for the subdax job subdax_CyberShake_GLBT_76_dax_76 in condor recovery mode.

      Normally monitord would assign this a new job instance id, but for some reason if the last state was a PRE_SCRIPT state, monitord assigns the same job instance id. This leads to invocation integrity constraint to be violated in the netlogger code.

      monitord should be updated to avoid this case.

            Assignee:
            Karan Vahi
            Reporter:
            Karan Vahi
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: