monitord loses track when rescue dag is submitted

XMLWordPrintable

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major
    • master, 3.1, 4.0
    • Affects Version/s: master, 3.1
    • Component/s: Monitord
    • None
    • Environment:

      For the last cyershake run, the workflow ran and the last sub workflow failed due to job errors in the sub workflow.

      Looking at the logs, monitord tracked and populated the workflow to the DB correctly
      -ed7d73da2b41 - CyberShake_PEDL_69-69
      1314421838 - 2011-08-26T22:10:38+0000 - MONITORD_FINISHED - 2d124c22-f0d2-4a52-956f-7e9ac6a51a67 - CyberShake_PEDL_67-67
      1314421838 - 2011-08-26T22:10:38+0000 - MONITORD_FINISHED - fea5c2a2-7aa7-46dc-82ca-bcce442ccde7 - CyberShake_PEDL_21-21
      1314421838 - 2011-08-26T22:10:38+0000 - MONITORD_FINISHED - e12094b0-caa0-4cfd-9d78-84d49d7354c6 - CyberShake_PEDL_20-20
      1314421874 - 2011-08-26T22:11:14+0000 - MONITORD_FINISHED - 75689e1f-275c-4687-a0f6-9cdcaf3c7bd1 - CyberShake_PEDL_23-23
      1314421913 - 2011-08-26T22:11:53+0000 - MONITORD_FINISHED - 1453a8cb-6f7c-4a74-b923-9c02bb056d3e - CyberShake_PEDL_22-22
      1314421916 - 2011-08-26T22:11:56+0000 - MONITORD_FINISHED - 115dc580-935f-448c-ba3b-728265950d95 - CyberShake_PEDL_70-70
      1314421916 - 2011-08-26T22:11:56+0000 - MONITORD_FINISHED - 6866aa89-77a0-4de1-ab01-027c1d1e53c0 - CyberShake_PEDL_25-25
      1314421916 - 2011-08-26T22:11:56+0000 - MONITORD_FINISHED - 86e7c3d8-4b52-47ed-b12d-cbd12b5f08fd - CyberShake_PEDL_24-24
      1314421916 - 2011-08-26T22:11:56+0000 - MONITORD_FINISHED - a22e290f-4fbd-45fb-9588-b4f84bf66f16 - CyberShake_PEDL_72-72
      1314421916 - 2011-08-26T22:11:56+0000 - MONITORD_FINISHED - c1dd0f54-70d4-435c-bd86-99e43e2d034a - CyberShake_PEDL_71-71

      2011-08-26T22:11:56+0000 - pegasus-monitord ending -----------------------------

      2011-08-26T22:11:56+0000 - pegasus-monitord - DB flushing beginning ------------
      2011-08-26T22:12:08+0000 - pegasus-monitord - DB flushing ended ----------------

      Scott fixed the errors that caused the sub workflow to fail and submitted the rescue dag using pegasus run. However, then monitord seems to get confused about the job submit sequence for a job

      2011-08-29T08:05:44+0000 - pegasus-monitord starting ---------------------------

      1314630345 - 2011-08-29T08:05:45+0000 - MONITORD_STARTED - 14ee5543-8d28-4a8d-82b3-c425b33e7e9d - CyberShake_PEDL-0
      1314630363 - 2011-08-29T08:06:03+0000 - MONITORD_STARTED - 6fadedd2-7ec2-4c54-87b0-0c73033d16d8 - CyberShake_PEDL_db_products-0
      2011-08-29 08:06:05,373:pegasus-monitord:add:3365: WARNING: cannot find job_submit_seq for job: Load_Amps_CS_Products_Load_Amps_PEDL
      2011-08-29 08:06:05,388:pegasus-monitord:add:3365: WARNING: cannot find job_submit_seq for job: Load_Amps_CS_Products_Load_Amps_PEDL
      2011-08-29 08:06:05,389:pegasus-monitord:add:3365: WARNING: cannot find job_submit_seq for job: Load_Amps_CS_Products_Load_Amps_PEDL
      2011-08-29 08:06:05,389:pegasus-monitord:add:3365: WARNING: cannot find job_submit_seq for job: Load_Amps_CS_Products_Load_Amps_PEDL
      2011-08-29 08:06:05,389:pegasus-monitord:add:3365: WARNING: cannot find job_submit_seq for job: Load_Amps_CS_Products_Load_Amps_PEDL
      1314630366 - 2011-08-29T08:06:06+0000 - MONITORD_FINISHED - 6fadedd2-7ec2-4c54-87b0-0c73033d16d8 - CyberShake_PEDL_db_products-0
      1314631704 - 2011-08-29T08:28:24+0000 - MONITORD_STARTED - 6fadedd2-7ec2-4c54-87b0-0c73033d16d8 - CyberShake_PEDL_db_products-0
      1314631708 - 2011-08-29T08:28:28+0000 - MONITORD_FINISHED - 6fadedd2-7ec2-4c54-87b0-0c73033d16d8 - CyberShake_PEDL_db_products-0
      1314633074 - 2011-08-29T08:51:14+0000 - MONITORD_STARTED - 6fadedd2-7ec2-4c54-87b0-0c73033d16d8 - CyberShake_PEDL_db_products-0
      1314633080 - 2011-08-29T08:51:20+0000 - MONITORD_FINISHED - 6fadedd2-7ec2-4c54-87b0-0c73033d16d8 - CyberShake_PEDL_db_products-0
      1314634426 - 2011-08-29T09:13:46+0000 - MONITORD_STARTED - 6fadedd2-7ec2-4c54-87b0-0c73033d16d8 - CyberShake_PEDL_db_products-0
      1314634431 - 2011-08-29T09:13:51+0000 - MONITORD_FINISHED - 6fadedd2-7ec2-4c54-87b0-0c73033d16d8 - CyberShake_PEDL_db_products-0
      1314635597 - 2011-08-29T09:33:17+0000 - MONITORD_FINISHED - 14ee5543-8d28-4a8d-82b3-c425b33e7e9d - CyberShake_PEDL-0

      2011-08-29T09:33:17+0000 - pegasus-monitord ending -----------------------------

      2011-08-29T09:33:17+0000 - pegasus-monitord - DB flushing beginning ------------
      2011-08-29T09:33:17+0000 - pegasus-monitord - DB flushing ended ----------------

      monitord is again started a couple of times and then we start seeing these erorrs

      1314646730 - 2011-08-29T12:38:50+0000 - MONITORD_STARTED - 14ee5543-8d28-4a8d-82b3-c425b33e7e9d - CyberShake_PEDL-0
      2011-08-29 12:39:01,679:pegasus-monitord:add:3365: WARNING: cannot find job_submit_seq for job: subdax_CyberShake_PEDL_38_dax_38
      2011-08-29 12:39:01,681:pegasus-monitord:add:3365: WARNING: cannot find job_submit_seq for job: subdax_CyberShake_PEDL_38_dax_38
      2011-08-29 16:06:47,046:nllog.py:log:165: ERROR: ts=2011-08-29T23:06:47.046387Z event=netlogger.analysis.modules.stampede_loader.Analyzer.batch_flush level=Error msg="Connection problem during commit: (OperationalError) (1048, \"Column 'transformation' ca
      not be null\") 'INSERT INTO invocation (job_instance_id, task_submit_seq, start_time, remote_duration, exitcode, transformation, executable, argv, abs_task_id, wf_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' (44934L, '1', 1314646724.0, '7315', '0'
      None, '', None, None, 89L) - reattempting batch"
      2011-08-29 16:06:47,057:nllog.py:log:165: ERROR: ts=2011-08-29T23:06:47.057190Z event=netlogger.analysis.modules.stampede_loader.Analyzer.batch_flush level=Error msg="Connection problem during commit: (OperationalError) (1048, \"Column 'transformation' ca
      not be null\") 'INSERT INTO invocation (job_instance_id, task_submit_seq, start_time, remote_duration, exitcode, transformation, executable, argv, abs_task_id, wf_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' (44934L, '1', 1314646724.0, '7315', '0'
      None, '', None, None, 89L) - reattempting batch"
      2011-08-29 16:06:47,081:nllog.py:log:165: ERROR: ts=2011-08-29T23:06:47.081033Z event=netlogger.analysis.modules.stampede_loader.Analyzer.batch_flush level=Error msg="Connection problem during commit: (OperationalError) (1048, \"Column 'transformation' ca
      not be null\") 'INSERT INTO invocation (job_instance_id, task_submit_seq, start_time, remote_duration, exitcode, transformation, executable, argv, abs_task_id, wf_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' (44934L, '1', 1314646724.0, '7315', '0'
      None, '', None, None, 89L) - reattempting batch"
      2011-08-29 16:06:47,086:nllog.py:log:165: ERROR: ts=2011-08-29T23:06:47.086287Z event=netlogger.analysis.modules.stampede_loader.Analyzer.batch_flush level=Error msg="Connection problem during commit: (OperationalError) (1048, \"Column 'transformation' ca
      not be null\") 'INSERT INTO invocation (job_instance_id, task_submit_seq, start_time, remote_duration, exitcode, transformation, executable, argv, abs_task_id, wf_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' (44934L, '1', 1314646724.0, '7315', '0'
      None, '', None, None, 89L) - reattempting batch"
      2011-08-29 16:06:47,091:nllog.py:log:165: ERROR: ts=2011-08-29T23:06:47.091753Z event=netlogger.analysis.modules.stampede_loader.Analyzer.batch_flush level=Error msg="Connection problem during commit: (OperationalError) (1048, \"Column 'transformation' ca
      not be null\") 'INSERT INTO invocation (job_instance_id, task_submit_seq, start_time, remote_duration, exitcode, transformation, executable, argv, abs_task_id, wf_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' (44934L, '1', 1314646724.0, '7315', '0'
      None, '', None, None, 89L) - reattempting batch"

            Assignee:
            Fabio Silva
            Reporter:
            Karan Vahi
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: