-
Type: Bug
-
Resolution: Fixed
-
Priority: Major
-
Affects Version/s: master, 3.1
-
Component/s: Monitord
-
None
-
Environment:On shock, the run in the submit directory /home/scec-02/cybershk/runs/PEDL_PP_dax/dags/cybershk/pegasus/CyberShake_PEDL/20110826T132625-0800
the tar file is avalable at
http://obelix.isi.edu/stampede-workflows/SCEC/cybershake-pedl-jira-pm-482.tgz
For the last cyershake run, the workflow ran and the last sub workflow failed due to job errors in the sub workflow.
Looking at the logs, monitord tracked and populated the workflow to the DB correctly
-ed7d73da2b41 - CyberShake_PEDL_69-69
1314421838 - 2011-08-26T22:10:38+0000 - MONITORD_FINISHED - 2d124c22-f0d2-4a52-956f-7e9ac6a51a67 - CyberShake_PEDL_67-67
1314421838 - 2011-08-26T22:10:38+0000 - MONITORD_FINISHED - fea5c2a2-7aa7-46dc-82ca-bcce442ccde7 - CyberShake_PEDL_21-21
1314421838 - 2011-08-26T22:10:38+0000 - MONITORD_FINISHED - e12094b0-caa0-4cfd-9d78-84d49d7354c6 - CyberShake_PEDL_20-20
1314421874 - 2011-08-26T22:11:14+0000 - MONITORD_FINISHED - 75689e1f-275c-4687-a0f6-9cdcaf3c7bd1 - CyberShake_PEDL_23-23
1314421913 - 2011-08-26T22:11:53+0000 - MONITORD_FINISHED - 1453a8cb-6f7c-4a74-b923-9c02bb056d3e - CyberShake_PEDL_22-22
1314421916 - 2011-08-26T22:11:56+0000 - MONITORD_FINISHED - 115dc580-935f-448c-ba3b-728265950d95 - CyberShake_PEDL_70-70
1314421916 - 2011-08-26T22:11:56+0000 - MONITORD_FINISHED - 6866aa89-77a0-4de1-ab01-027c1d1e53c0 - CyberShake_PEDL_25-25
1314421916 - 2011-08-26T22:11:56+0000 - MONITORD_FINISHED - 86e7c3d8-4b52-47ed-b12d-cbd12b5f08fd - CyberShake_PEDL_24-24
1314421916 - 2011-08-26T22:11:56+0000 - MONITORD_FINISHED - a22e290f-4fbd-45fb-9588-b4f84bf66f16 - CyberShake_PEDL_72-72
1314421916 - 2011-08-26T22:11:56+0000 - MONITORD_FINISHED - c1dd0f54-70d4-435c-bd86-99e43e2d034a - CyberShake_PEDL_71-71
2011-08-26T22:11:56+0000 - pegasus-monitord ending -----------------------------
2011-08-26T22:11:56+0000 - pegasus-monitord - DB flushing beginning ------------
2011-08-26T22:12:08+0000 - pegasus-monitord - DB flushing ended ----------------
Scott fixed the errors that caused the sub workflow to fail and submitted the rescue dag using pegasus run. However, then monitord seems to get confused about the job submit sequence for a job
2011-08-29T08:05:44+0000 - pegasus-monitord starting ---------------------------
1314630345 - 2011-08-29T08:05:45+0000 - MONITORD_STARTED - 14ee5543-8d28-4a8d-82b3-c425b33e7e9d - CyberShake_PEDL-0
1314630363 - 2011-08-29T08:06:03+0000 - MONITORD_STARTED - 6fadedd2-7ec2-4c54-87b0-0c73033d16d8 - CyberShake_PEDL_db_products-0
2011-08-29 08:06:05,373:pegasus-monitord:add:3365: WARNING: cannot find job_submit_seq for job: Load_Amps_CS_Products_Load_Amps_PEDL
2011-08-29 08:06:05,388:pegasus-monitord:add:3365: WARNING: cannot find job_submit_seq for job: Load_Amps_CS_Products_Load_Amps_PEDL
2011-08-29 08:06:05,389:pegasus-monitord:add:3365: WARNING: cannot find job_submit_seq for job: Load_Amps_CS_Products_Load_Amps_PEDL
2011-08-29 08:06:05,389:pegasus-monitord:add:3365: WARNING: cannot find job_submit_seq for job: Load_Amps_CS_Products_Load_Amps_PEDL
2011-08-29 08:06:05,389:pegasus-monitord:add:3365: WARNING: cannot find job_submit_seq for job: Load_Amps_CS_Products_Load_Amps_PEDL
1314630366 - 2011-08-29T08:06:06+0000 - MONITORD_FINISHED - 6fadedd2-7ec2-4c54-87b0-0c73033d16d8 - CyberShake_PEDL_db_products-0
1314631704 - 2011-08-29T08:28:24+0000 - MONITORD_STARTED - 6fadedd2-7ec2-4c54-87b0-0c73033d16d8 - CyberShake_PEDL_db_products-0
1314631708 - 2011-08-29T08:28:28+0000 - MONITORD_FINISHED - 6fadedd2-7ec2-4c54-87b0-0c73033d16d8 - CyberShake_PEDL_db_products-0
1314633074 - 2011-08-29T08:51:14+0000 - MONITORD_STARTED - 6fadedd2-7ec2-4c54-87b0-0c73033d16d8 - CyberShake_PEDL_db_products-0
1314633080 - 2011-08-29T08:51:20+0000 - MONITORD_FINISHED - 6fadedd2-7ec2-4c54-87b0-0c73033d16d8 - CyberShake_PEDL_db_products-0
1314634426 - 2011-08-29T09:13:46+0000 - MONITORD_STARTED - 6fadedd2-7ec2-4c54-87b0-0c73033d16d8 - CyberShake_PEDL_db_products-0
1314634431 - 2011-08-29T09:13:51+0000 - MONITORD_FINISHED - 6fadedd2-7ec2-4c54-87b0-0c73033d16d8 - CyberShake_PEDL_db_products-0
1314635597 - 2011-08-29T09:33:17+0000 - MONITORD_FINISHED - 14ee5543-8d28-4a8d-82b3-c425b33e7e9d - CyberShake_PEDL-0
2011-08-29T09:33:17+0000 - pegasus-monitord ending -----------------------------
2011-08-29T09:33:17+0000 - pegasus-monitord - DB flushing beginning ------------
2011-08-29T09:33:17+0000 - pegasus-monitord - DB flushing ended ----------------
monitord is again started a couple of times and then we start seeing these erorrs
1314646730 - 2011-08-29T12:38:50+0000 - MONITORD_STARTED - 14ee5543-8d28-4a8d-82b3-c425b33e7e9d - CyberShake_PEDL-0
2011-08-29 12:39:01,679:pegasus-monitord:add:3365: WARNING: cannot find job_submit_seq for job: subdax_CyberShake_PEDL_38_dax_38
2011-08-29 12:39:01,681:pegasus-monitord:add:3365: WARNING: cannot find job_submit_seq for job: subdax_CyberShake_PEDL_38_dax_38
2011-08-29 16:06:47,046:nllog.py:log:165: ERROR: ts=2011-08-29T23:06:47.046387Z event=netlogger.analysis.modules.stampede_loader.Analyzer.batch_flush level=Error msg="Connection problem during commit: (OperationalError) (1048, \"Column 'transformation' ca
not be null\") 'INSERT INTO invocation (job_instance_id, task_submit_seq, start_time, remote_duration, exitcode, transformation, executable, argv, abs_task_id, wf_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' (44934L, '1', 1314646724.0, '7315', '0'
None, '', None, None, 89L) - reattempting batch"
2011-08-29 16:06:47,057:nllog.py:log:165: ERROR: ts=2011-08-29T23:06:47.057190Z event=netlogger.analysis.modules.stampede_loader.Analyzer.batch_flush level=Error msg="Connection problem during commit: (OperationalError) (1048, \"Column 'transformation' ca
not be null\") 'INSERT INTO invocation (job_instance_id, task_submit_seq, start_time, remote_duration, exitcode, transformation, executable, argv, abs_task_id, wf_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' (44934L, '1', 1314646724.0, '7315', '0'
None, '', None, None, 89L) - reattempting batch"
2011-08-29 16:06:47,081:nllog.py:log:165: ERROR: ts=2011-08-29T23:06:47.081033Z event=netlogger.analysis.modules.stampede_loader.Analyzer.batch_flush level=Error msg="Connection problem during commit: (OperationalError) (1048, \"Column 'transformation' ca
not be null\") 'INSERT INTO invocation (job_instance_id, task_submit_seq, start_time, remote_duration, exitcode, transformation, executable, argv, abs_task_id, wf_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' (44934L, '1', 1314646724.0, '7315', '0'
None, '', None, None, 89L) - reattempting batch"
2011-08-29 16:06:47,086:nllog.py:log:165: ERROR: ts=2011-08-29T23:06:47.086287Z event=netlogger.analysis.modules.stampede_loader.Analyzer.batch_flush level=Error msg="Connection problem during commit: (OperationalError) (1048, \"Column 'transformation' ca
not be null\") 'INSERT INTO invocation (job_instance_id, task_submit_seq, start_time, remote_duration, exitcode, transformation, executable, argv, abs_task_id, wf_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' (44934L, '1', 1314646724.0, '7315', '0'
None, '', None, None, 89L) - reattempting batch"
2011-08-29 16:06:47,091:nllog.py:log:165: ERROR: ts=2011-08-29T23:06:47.091753Z event=netlogger.analysis.modules.stampede_loader.Analyzer.batch_flush level=Error msg="Connection problem during commit: (OperationalError) (1048, \"Column 'transformation' ca
not be null\") 'INSERT INTO invocation (job_instance_id, task_submit_seq, start_time, remote_duration, exitcode, transformation, executable, argv, abs_task_id, wf_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' (44934L, '1', 1314646724.0, '7315', '0'
None, '', None, None, 89L) - reattempting batch"