Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1181

monitord fails to exit if database is locked

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 4.7.4
    • Fix Version/s: master, 4.8.0, 4.7.5
    • Component/s: Monitord
    • Labels:
      None

      Description

      A user had a bad master database (looks like a lock from a crashed program). When he tried to rebuild the database for a workflow, monitord hit a lock problem, but failed to exit properly and print a useful error message. I had to kill it with a ctrl-c.

      [marton.tapai@sugwg-osg ~]$ pegasus-monitord --replay /usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.dag.dagman.out
      2017-03-30 11:21:19,178:INFO:pegasus-monitord(68): pegasus-monitord starting - pid 859844
      2017-03-30 11:21:19,184:INFO:pegasus-monitord(319): Final Command line options are: ['--replay', '/usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.dag.dagman.out']
      2017-03-30 11:21:19,186:INFO:pegasus-monitord(960): Rotating sqlite db file /usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.stampede.db
      Created Pegasus database in: sqlite:////usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.stampede.db
      2017-03-30 11:21:19,349:INFO:Pegasus.monitoring.event_output(315): output type=DB namespace=stampede. name=sqlite:////usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.stampede.db
      2017-03-30 11:21:19,463:INFO:Pegasus.monitoring.event_output(315): output type=DB namespace=dashboard. name=sqlite:////home/marton.tapai/.pegasus/workflow.db
      2017-03-30 11:21:19,463:INFO:Pegasus.db.expunge(62): Expunging 1772374b-a238-49da-8d22-0b563ed6269e from dashboard database
      2017-03-30 11:21:24,628:ERROR:pegasus-monitord(998): (OperationalError) database is locked None None
      Traceback (most recent call last):
        File "/bin/pegasus-monitord", line 996, in <module>
          eo.purge_wf_uuid_from_dashboard_database(run, dashboard_event_dest)
        File "/usr/lib64/python2.7/site-packages/Pegasus/monitoring/event_output.py", line 83, in purge_wf_uuid_from_dashboard_database
          expunge.delete_dashboard_workflow(output_db, wf_uuid)
        File "/usr/lib64/python2.7/site-packages/Pegasus/db/expunge.py", line 77, in delete_dashboard_workflow
          session.commit()
        File "/usr/lib64/pegasus/externals/python/sqlalchemy/orm/scoping.py", line 149, in do
          return getattr(self.registry(), name)(*args, **kwargs)
        File "/usr/lib64/pegasus/externals/python/sqlalchemy/orm/session.py", line 719, in commit
          self.transaction.commit()
        File "/usr/lib64/pegasus/externals/python/sqlalchemy/orm/session.py", line 356, in commit
          t[1].commit()
        File "/usr/lib64/pegasus/externals/python/sqlalchemy/engine/base.py", line 1203, in commit
          self._do_commit()
        File "/usr/lib64/pegasus/externals/python/sqlalchemy/engine/base.py", line 1234, in _do_commit
          self.connection._commit_impl()
        File "/usr/lib64/pegasus/externals/python/sqlalchemy/engine/base.py", line 492, in _commit_impl
          self._handle_dbapi_exception(e, None, None, None, None)
        File "/usr/lib64/pegasus/externals/python/sqlalchemy/engine/base.py", line 489, in _commit_impl
          self.engine.dialect.do_commit(self.connection)
        File "/usr/lib64/pegasus/externals/python/sqlalchemy/engine/default.py", line 293, in do_commit
          dbapi_connection.commit()
      OperationalError: (OperationalError) database is locked None None
      2017-03-30 11:21:24,630:ERROR:pegasus-monitord(999): error flushing previous wf_uuid from dashboard database... continuing...
      2017-03-30 11:21:24,630:ERROR:pegasus-monitord(1000): cannot create events output... disabling event output!
      2017-03-30 11:21:24,630:ERROR:pegasus-monitord(1004): cannot create dashboard events output... disabling dashboard event output!
      2017-03-30 11:21:24,630:INFO:pegasus-monitord(1010): monitord started in fast start mode
      2017-03-30 11:21:24,636:INFO:Pegasus.monitoring.workflow(953): Rotating jobstate.log replay_mode 1 previous_processed_line 0
      2017-03-30 11:21:24,653:ERROR:Pegasus.db.dashboard_loader.DashboardLoader(82): Insert failed for event "dashboard.wf.plan" : (IntegrityError) column wf_uuid is not unique u'INSERT INTO master_workflow (wf_uuid, dax_label, dax_version, dax_file, dag_file_name, timestamp, submit_hostname, submit_dir, planner_arguments, user, grid_dn, planner_version, db_url, archived) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)' ('1772374b-a238-49da-8d22-0b563ed6269e', 'analysis8_full_run_01', '3.6', '/home/marton.tapai/analysis/O2/analysis8/full_run_01/fullrun01/analysis8_full_run_01.dax', 'analysis8_full_run_01-0.dag', 1489251422.0, 'sugwg-osg.phy.syr.edu', '/usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work', None, 'marton.tapai', '/DC=org/DC=cilogon/C=US/O=LIGO/CN=Marton Tapai marton.tapai@ligo.org', '4.7.4', 'sqlite:////usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.stampede.db&#39;, 0)
      2017-03-30 11:21:24,679:INFO:pegasus-monitord(773): Using DAGMan version 8.6.0 80600
      2017-03-30 11:21:24,683:INFO:pegasus-monitord(759): using dag analysis8_full_run_01-0.dag
      2017-03-30 11:21:24,683:INFO:Pegasus.monitoring.workflow(675): DAGMan starting with condor id 7976151.0
      2017-03-30 11:21:24,691:ERROR:Pegasus.db.dashboard_loader.DashboardLoader(184): Integrity error on batch flush: (IntegrityError) columns wf_id, state, timestamp are not unique u'INSERT INTO master_workflowstate (wf_id, state, timestamp, restart_count, status, reason) VALUES (?, ?, ?, ?, ?, ?)' (4, 'WORKFLOW_STARTED', 1489251425.0, 0, None, None) - batch will need to be committed per-event which will take longer
      2017-03-30 11:21:24,697:ERROR:Pegasus.db.dashboard_loader.DashboardLoader(107): Insert failed for event <class 'Pegasus.db.schema.DashboardWorkflowstate'>:
        * wf_id : 4
        * state : WORKFLOW_STARTED
        * wf_uuid : 1772374b-a238-49da-8d22-0b563ed6269e
        * timestamp : 1489251425.0
        * restart_count : 0
        * ts : 1489251425.0
        * reason : None
        * event : dashboard.xwf.start
       : (IntegrityError) columns wf_id, state, timestamp are not unique u'INSERT INTO master_workflowstate (wf_id, state, timestamp, restart_count, status, reason) VALUES (?, ?, ?, ?, ?, ?)' (4, 'WORKFLOW_STARTED', 1489251425.0, 0, None, None)
      2017-03-30 11:21:24,722:INFO:pegasus-monitord(1287): found new workflow to track: /usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-main_ID0000001.000/analysis8_full_run_01-main-0.dag.dagman.out
      2017-03-30 11:21:24,723:INFO:Pegasus.monitoring.workflow(953): Rotating jobstate.log replay_mode 1 previous_processed_line 0
      ^C2017-03-30 11:21:26,303:WARNING:pegasus-monitord(873): graceful exit on signal 2
      2017-03-30 11:21:26,304:WARNING:Pegasus.monitoring.workflow(1113): unable to remove recover file: /usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/monitord.recover
      2017-03-30 11:21:26,304:CRITICAL:Pegasus.monitoring.workflow(1047): error processing static bp file /usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-main_ID0000001.000/analysis8_full_run_01-main-0.static.bp, exiting...
      2017-03-30 11:21:26,305:CRITICAL:Pegasus.monitoring.workflow(1048): Traceback (most recent call last):
        File "/usr/lib64/python2.7/site-packages/Pegasus/monitoring/workflow.py", line 1031, in __init__
          my_keys = my_bp_parser.parseLine(my_line)
        File "/usr/lib64/python2.7/site-packages/Pegasus/netlogger/parsers/base.py", line 593, in parseLine
          fields = _bp_extract(line, validate=self.verify)
        File "/usr/lib64/python2.7/site-packages/Pegasus/netlogger/parsers/base.py", line 180, in _bp_extract
          for n, v, vq in BP_EXPR.findall(s):
        File "/bin/pegasus-monitord", line 883, in prog_sigint_handler
          sys.exit(1)
      SystemExit: 1

      2017-03-30 11:21:26,307:INFO:pegasus-monitord(182): DB flushing beginning
      2017-03-30 11:21:26,307:INFO:Pegasus.db.workflow_loader.WorkflowLoader(1052): Executing final flush
      2017-03-30 11:21:26,396:INFO:Pegasus.db.dashboard_loader.DashboardLoader(343): Executing final flush
      2017-03-30 11:21:26,397:INFO:pegasus-monitord(196): DB flushing ended

        Attachments

          Activity

            People

            • Assignee:
              vahi Karan Vahi
              Reporter:
              dbrown Duncan Brown
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: