For one of the SCEC runs, we saw events of the following type in the monitord.log
Directory on shock.usc.edu
/home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800
2011-10-12 12:33:20,736:nllog.py:log:165: ERROR: ts=2011-10-12T19:33:20.736679Z event=netlogger.analysis.modules.stampede_loader.Analyzer.batch_flush level=Error msg="Integrity error on batch flush: (IntegrityError) (1062, \"Duplicate entry '787857-
-1' for key 'UNIQUE_INVOCATION'\") 'INSERT INTO invocation (job_instance_id, task_submit_seq, start_time, remote_duration, exitcode, transformation, executable, argv, abs_task_id, wf_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' (787857L, '-1
', 1315418427.0, '26', '0', 'dagman::pre', '/opt/pegasus/default/bin/pegasus-plan', ' -Dpegasus.log.*=/home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/subdax_CyberShake_GLBT_76_dax_76.pre.log -Dpegas
us.workflow.root.uuid=4f322a53-6b86-4b58-9610-c30d877c2c2e --conf /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/pegasus.760.properties --dir /home/scec-02/cybershk/runs/GLBT_PP_dax/dags --relative
-dir cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/CyberShake_GLBT_76_dax_76 --relative-submit-dir cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/./CyberShake_GLBT_76_dax_76 --sites opensha,ranger,shock --cache /home/scec-02/cybershk/
runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/./CyberShake_GLBT_pre_preDAX/CyberShake_GLBT_pre-0.cache --inherited-rc-files /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0
800/CyberShake_GLBT-0.replica.store --cluster horizontal --output shock --force --nocleanup --quiet --quiet --quiet --quiet --quiet --monitor --deferred --group pegasus --rescue 100 --dax /home/scec-02/cybershk/runs/GLBT_PP_dax/CyberShake_G
LBT_76.dax ', None, 4615L) - batch will need to be committed per-event which will take longer"
2011-10-12 12:33:28,552:nllog.py:log:165: ERROR: ts=2011-10-12T19:33:28.552493Z event=netlogger.analysis.modules.stampede_loader.Analyzer.individual_commit level=Error msg="Insert failed for event <class 'netlogger.analysis.schema.stampede_schema.In
vocation'>:
After running the monitord in a replay mode and creating a bp file, we do see that there are duplicate events
bash-3.2$ cat CyberShake_GLBT-20110907T071342-0800.bp | grep 4f322a53-6b86-4b58-9610-c30d877c2c2e | grep inv.end | grep subdax_CyberShake_GLBT_76_dax_76
ts=2011-09-07T15:35:57.000000Z event=stampede.inv.end level=Info xwf.id=4f322a53-6b86-4b58-9610-c30d877c2c2e executable=/opt/pegasus/default/bin/pegasus-plan job_inst.id=23 start_time=1315409668 job.id=subdax_CyberShake_GLBT_76_dax_76 argv=" -Dpegasus.log.*=/home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/subdax_CyberShake_GLBT_76_dax_76.pre.log -Dpegasus.workflow.root.uuid=4f322a53-6b86-4b58-9610-c30d877c2c2e --conf /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/pegasus.760.properties --dir /home/scec-02/cybershk/runs/GLBT_PP_dax/dags --relative-dir cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/CyberShake_GLBT_76_dax_76 --relative-submit-dir cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/./CyberShake_GLBT_76_dax_76 --sites opensha,ranger,shock --cache /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/./CyberShake_GLBT_pre_preDAX/CyberShake_GLBT_pre-0.cache --inherited-rc-files /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/CyberShake_GLBT-0.replica.store --cluster horizontal --output shock --force --nocleanup --quiet --quiet --quiet --quiet --quiet --monitor --deferred --group pegasus --rescue 100 --dax /home/scec-02/cybershk/runs/GLBT_PP_dax/CyberShake_GLBT_76.dax " inv.id=-1 dur=89 transformation=dagman::pre exitcode=0
ts=2011-09-07T18:00:53.000000Z event=stampede.inv.end level=Info xwf.id=4f322a53-6b86-4b58-9610-c30d877c2c2e executable=/opt/pegasus/default/bin/pegasus-plan job_inst.id=23 start_time=1315418427 job.id=subdax_CyberShake_GLBT_76_dax_76 argv=" -Dpegasus.log.*=/home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/subdax_CyberShake_GLBT_76_dax_76.pre.log -Dpegasus.workflow.root.uuid=4f322a53-6b86-4b58-9610-c30d877c2c2e --conf /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/pegasus.760.properties --dir /home/scec-02/cybershk/runs/GLBT_PP_dax/dags --relative-dir cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/CyberShake_GLBT_76_dax_76 --relative-submit-dir cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/./CyberShake_GLBT_76_dax_76 --sites opensha,ranger,shock --cache /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/./CyberShake_GLBT_pre_preDAX/CyberShake_GLBT_pre-0.cache --inherited-rc-files /home/scec-02/cybershk/runs/GLBT_PP_dax/dags/cybershk/pegasus/CyberShake_GLBT/20110907T071342-0800/CyberShake_GLBT-0.replica.store --cluster horizontal --output shock --force --nocleanup --quiet --quiet --quiet --quiet --quiet --monitor --deferred --group pegasus --rescue 100 --dax /home/scec-02/cybershk/runs/GLBT_PP_dax/CyberShake_GLBT_76.dax " inv.id=-1 dur=26 transformation=dagman::pre exitcode=0
ts=2011-09-09T01:32:24.000000Z event=stampede.inv.end level=Info xwf.id=4f322a53-6b86-4b58-9610-c30d877c2c2e executable=/usr/local/condor/default/bin/condor_dagman job_inst.id=23 start_time=1315420910 job.id=subdax_CyberShake_GLBT_76_dax_76 argv=" -f -l . -Debug 3 -Lockfile CyberShake_GLBT_76-76.dag.lock -Dag CyberShake_GLBT_76-76.dag -AllowVersionMismatch -AutoRescue 1 -DoRescueFrom 0 -MaxPre 2 -MaxIdle 200" task.id=dax_76 inv.id=1 dur=111034 transformation=condor::dagman exitcode=0
-bash-3.2$
Looking at dagman.out file, we saw that the prescript was run again for the subdax job subdax_CyberShake_GLBT_76_dax_76 in condor recovery mode.
Normally monitord would assign this a new job instance id, but for some reason if the last state was a PRE_SCRIPT state, monitord assigns the same job instance id. This leads to invocation integrity constraint to be violated in the netlogger code.
monitord should be updated to avoid this case.
- Trackbacks
-
SCEC Cybershake Meetings Table Of Contents: April 2012 Apr 4th, 2012 Februa