monitord fails to exit if database is locked

XMLWordPrintable

    • Type: Bug
    • Resolution: Duplicate
    • Priority: Major
    • master, 4.8.0, 4.7.5
    • Affects Version/s: 4.7.4
    • Component/s: Monitord
    • None

      A user had a bad master database (looks like a lock from a crashed program). When he tried to rebuild the database for a workflow, monitord hit a lock problem, but failed to exit properly and print a useful error message. I had to kill it with a ctrl-c.

      [marton.tapai@sugwg-osg ~]$ pegasus-monitord --replay /usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.dag.dagman.out
      2017-03-30 11:21:19,178:INFO:pegasus-monitord(68): pegasus-monitord starting - pid 859844
      2017-03-30 11:21:19,184:INFO:pegasus-monitord(319): Final Command line options are: ['--replay', '/usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.dag.dagman.out']
      2017-03-30 11:21:19,186:INFO:pegasus-monitord(960): Rotating sqlite db file /usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.stampede.db
      Created Pegasus database in: sqlite:////usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.stampede.db
      2017-03-30 11:21:19,349:INFO:Pegasus.monitoring.event_output(315): output type=DB namespace=stampede. name=sqlite:////usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.stampede.db
      2017-03-30 11:21:19,463:INFO:Pegasus.monitoring.event_output(315): output type=DB namespace=dashboard. name=sqlite:////home/marton.tapai/.pegasus/workflow.db
      2017-03-30 11:21:19,463:INFO:Pegasus.db.expunge(62): Expunging 1772374b-a238-49da-8d22-0b563ed6269e from dashboard database
      2017-03-30 11:21:24,628:ERROR:pegasus-monitord(998): (OperationalError) database is locked None None
      Traceback (most recent call last):
      File "/bin/pegasus-monitord", line 996, in <module>
      eo.purge_wf_uuid_from_dashboard_database(run, dashboard_event_dest)
      File "/usr/lib64/python2.7/site-packages/Pegasus/monitoring/event_output.py", line 83, in purge_wf_uuid_from_dashboard_database
      expunge.delete_dashboard_workflow(output_db, wf_uuid)
      File "/usr/lib64/python2.7/site-packages/Pegasus/db/expunge.py", line 77, in delete_dashboard_workflow
      session.commit()
      File "/usr/lib64/pegasus/externals/python/sqlalchemy/orm/scoping.py", line 149, in do
      return getattr(self.registry(), name)(*args, **kwargs)
      File "/usr/lib64/pegasus/externals/python/sqlalchemy/orm/session.py", line 719, in commit
      self.transaction.commit()
      File "/usr/lib64/pegasus/externals/python/sqlalchemy/orm/session.py", line 356, in commit
      t[1].commit()
      File "/usr/lib64/pegasus/externals/python/sqlalchemy/engine/base.py", line 1203, in commit
      self._do_commit()
      File "/usr/lib64/pegasus/externals/python/sqlalchemy/engine/base.py", line 1234, in _do_commit
      self.connection._commit_impl()
      File "/usr/lib64/pegasus/externals/python/sqlalchemy/engine/base.py", line 492, in _commit_impl
      self._handle_dbapi_exception(e, None, None, None, None)
      File "/usr/lib64/pegasus/externals/python/sqlalchemy/engine/base.py", line 489, in _commit_impl
      self.engine.dialect.do_commit(self.connection)
      File "/usr/lib64/pegasus/externals/python/sqlalchemy/engine/default.py", line 293, in do_commit
      dbapi_connection.commit()
      OperationalError: (OperationalError) database is locked None None
      2017-03-30 11:21:24,630:ERROR:pegasus-monitord(999): error flushing previous wf_uuid from dashboard database... continuing...
      2017-03-30 11:21:24,630:ERROR:pegasus-monitord(1000): cannot create events output... disabling event output!
      2017-03-30 11:21:24,630:ERROR:pegasus-monitord(1004): cannot create dashboard events output... disabling dashboard event output!
      2017-03-30 11:21:24,630:INFO:pegasus-monitord(1010): monitord started in fast start mode
      2017-03-30 11:21:24,636:INFO:Pegasus.monitoring.workflow(953): Rotating jobstate.log replay_mode 1 previous_processed_line 0
      2017-03-30 11:21:24,653:ERROR:Pegasus.db.dashboard_loader.DashboardLoader(82): Insert failed for event "dashboard.wf.plan" : (IntegrityError) column wf_uuid is not unique u'INSERT INTO master_workflow (wf_uuid, dax_label, dax_version, dax_file, dag_file_name, timestamp, submit_hostname, submit_dir, planner_arguments, user, grid_dn, planner_version, db_url, archived) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)' ('1772374b-a238-49da-8d22-0b563ed6269e', 'analysis8_full_run_01', '3.6', '/home/marton.tapai/analysis/O2/analysis8/full_run_01/fullrun01/analysis8_full_run_01.dax', 'analysis8_full_run_01-0.dag', 1489251422.0, 'sugwg-osg.phy.syr.edu', '/usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work', None, 'marton.tapai', '/DC=org/DC=cilogon/C=US/O=LIGO/CN=Marton Tapai marton.tapai@ligo.org', '4.7.4', 'sqlite:////usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.stampede.db', 0)
      2017-03-30 11:21:24,679:INFO:pegasus-monitord(773): Using DAGMan version 8.6.0 80600
      2017-03-30 11:21:24,683:INFO:pegasus-monitord(759): using dag analysis8_full_run_01-0.dag
      2017-03-30 11:21:24,683:INFO:Pegasus.monitoring.workflow(675): DAGMan starting with condor id 7976151.0
      2017-03-30 11:21:24,691:ERROR:Pegasus.db.dashboard_loader.DashboardLoader(184): Integrity error on batch flush: (IntegrityError) columns wf_id, state, timestamp are not unique u'INSERT INTO master_workflowstate (wf_id, state, timestamp, restart_count, status, reason) VALUES (?, ?, ?, ?, ?, ?)' (4, 'WORKFLOW_STARTED', 1489251425.0, 0, None, None) - batch will need to be committed per-event which will take longer
      2017-03-30 11:21:24,697:ERROR:Pegasus.db.dashboard_loader.DashboardLoader(107): Insert failed for event <class 'Pegasus.db.schema.DashboardWorkflowstate'>:

      • wf_id : 4
      • state : WORKFLOW_STARTED
      • wf_uuid : 1772374b-a238-49da-8d22-0b563ed6269e
      • timestamp : 1489251425.0
      • restart_count : 0
      • ts : 1489251425.0
      • reason : None
      • event : dashboard.xwf.start
        : (IntegrityError) columns wf_id, state, timestamp are not unique u'INSERT INTO master_workflowstate (wf_id, state, timestamp, restart_count, status, reason) VALUES (?, ?, ?, ?, ?, ?)' (4, 'WORKFLOW_STARTED', 1489251425.0, 0, None, None)
        2017-03-30 11:21:24,722:INFO:pegasus-monitord(1287): found new workflow to track: /usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-main_ID0000001.000/analysis8_full_run_01-main-0.dag.dagman.out
        2017-03-30 11:21:24,723:INFO:Pegasus.monitoring.workflow(953): Rotating jobstate.log replay_mode 1 previous_processed_line 0
        ^C2017-03-30 11:21:26,303:WARNING:pegasus-monitord(873): graceful exit on signal 2
        2017-03-30 11:21:26,304:WARNING:Pegasus.monitoring.workflow(1113): unable to remove recover file: /usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/monitord.recover
        2017-03-30 11:21:26,304:CRITICAL:Pegasus.monitoring.workflow(1047): error processing static bp file /usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-main_ID0000001.000/analysis8_full_run_01-main-0.static.bp, exiting...
        2017-03-30 11:21:26,305:CRITICAL:Pegasus.monitoring.workflow(1048): Traceback (most recent call last):
        File "/usr/lib64/python2.7/site-packages/Pegasus/monitoring/workflow.py", line 1031, in _init_
        my_keys = my_bp_parser.parseLine(my_line)
        File "/usr/lib64/python2.7/site-packages/Pegasus/netlogger/parsers/base.py", line 593, in parseLine
        fields = _bp_extract(line, validate=self.verify)
        File "/usr/lib64/python2.7/site-packages/Pegasus/netlogger/parsers/base.py", line 180, in _bp_extract
        for n, v, vq in BP_EXPR.findall(s):
        File "/bin/pegasus-monitord", line 883, in prog_sigint_handler
        sys.exit(1)
        SystemExit: 1

      2017-03-30 11:21:26,307:INFO:pegasus-monitord(182): DB flushing beginning
      2017-03-30 11:21:26,307:INFO:Pegasus.db.workflow_loader.WorkflowLoader(1052): Executing final flush
      2017-03-30 11:21:26,396:INFO:Pegasus.db.dashboard_loader.DashboardLoader(343): Executing final flush
      2017-03-30 11:21:26,397:INFO:pegasus-monitord(196): DB flushing ended

            Assignee:
            Karan Vahi
            Reporter:
            Duncan Brown
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: