A user had a bad master database (looks like a lock from a crashed program). When he tried to rebuild the database for a workflow, monitord hit a lock problem, but failed to exit properly and print a useful error message. I had to kill it with a ctrl-c.
[marton.tapai@sugwg-osg ~]$ pegasus-monitord --replay /usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.dag.dagman.out
2017-03-30 11:21:19,178:INFO:pegasus-monitord(68): pegasus-monitord starting - pid 859844
2017-03-30 11:21:19,184:INFO:pegasus-monitord(319): Final Command line options are: ['--replay', '/usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.dag.dagman.out']
2017-03-30 11:21:19,186:INFO:pegasus-monitord(960): Rotating sqlite db file /usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.stampede.db
Created Pegasus database in: sqlite:////usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.stampede.db
2017-03-30 11:21:19,349:INFO:Pegasus.monitoring.event_output(315): output type=DB namespace=stampede. name=sqlite:////usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.stampede.db
2017-03-30 11:21:19,463:INFO:Pegasus.monitoring.event_output(315): output type=DB namespace=dashboard. name=sqlite:////home/marton.tapai/.pegasus/workflow.db
2017-03-30 11:21:19,463:INFO:Pegasus.db.expunge(62): Expunging 1772374b-a238-49da-8d22-0b563ed6269e from dashboard database
2017-03-30 11:21:24,628:ERROR:pegasus-monitord(998): (OperationalError) database is locked None None
Traceback (most recent call last):
File "/bin/pegasus-monitord", line 996, in <module>
eo.purge_wf_uuid_from_dashboard_database(run, dashboard_event_dest)
File "/usr/lib64/python2.7/site-packages/Pegasus/monitoring/event_output.py", line 83, in purge_wf_uuid_from_dashboard_database
expunge.delete_dashboard_workflow(output_db, wf_uuid)
File "/usr/lib64/python2.7/site-packages/Pegasus/db/expunge.py", line 77, in delete_dashboard_workflow
session.commit()
File "/usr/lib64/pegasus/externals/python/sqlalchemy/orm/scoping.py", line 149, in do
return getattr(self.registry(), name)(*args, **kwargs)
File "/usr/lib64/pegasus/externals/python/sqlalchemy/orm/session.py", line 719, in commit
self.transaction.commit()
File "/usr/lib64/pegasus/externals/python/sqlalchemy/orm/session.py", line 356, in commit
t[1].commit()
File "/usr/lib64/pegasus/externals/python/sqlalchemy/engine/base.py", line 1203, in commit
self._do_commit()
File "/usr/lib64/pegasus/externals/python/sqlalchemy/engine/base.py", line 1234, in _do_commit
self.connection._commit_impl()
File "/usr/lib64/pegasus/externals/python/sqlalchemy/engine/base.py", line 492, in _commit_impl
self._handle_dbapi_exception(e, None, None, None, None)
File "/usr/lib64/pegasus/externals/python/sqlalchemy/engine/base.py", line 489, in _commit_impl
self.engine.dialect.do_commit(self.connection)
File "/usr/lib64/pegasus/externals/python/sqlalchemy/engine/default.py", line 293, in do_commit
dbapi_connection.commit()
OperationalError: (OperationalError) database is locked None None
2017-03-30 11:21:24,630:ERROR:pegasus-monitord(999): error flushing previous wf_uuid from dashboard database... continuing...
2017-03-30 11:21:24,630:ERROR:pegasus-monitord(1000): cannot create events output... disabling event output!
2017-03-30 11:21:24,630:ERROR:pegasus-monitord(1004): cannot create dashboard events output... disabling dashboard event output!
2017-03-30 11:21:24,630:INFO:pegasus-monitord(1010): monitord started in fast start mode
2017-03-30 11:21:24,636:INFO:Pegasus.monitoring.workflow(953): Rotating jobstate.log replay_mode 1 previous_processed_line 0
2017-03-30 11:21:24,653:ERROR:Pegasus.db.dashboard_loader.DashboardLoader(82): Insert failed for event "dashboard.wf.plan" : (IntegrityError) column wf_uuid is not unique u'INSERT INTO master_workflow (wf_uuid, dax_label, dax_version, dax_file, dag_file_name, timestamp, submit_hostname, submit_dir, planner_arguments, user, grid_dn, planner_version, db_url, archived) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)' ('1772374b-a238-49da-8d22-0b563ed6269e', 'analysis8_full_run_01', '3.6', '/home/marton.tapai/analysis/O2/analysis8/full_run_01/fullrun01/analysis8_full_run_01.dax', 'analysis8_full_run_01-0.dag', 1489251422.0, 'sugwg-osg.phy.syr.edu', '/usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work', None, 'marton.tapai', '/DC=org/DC=cilogon/C=US/O=LIGO/CN=Marton Tapai marton.tapai@ligo.org', '4.7.4', 'sqlite:////usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-0.stampede.db', 0)
2017-03-30 11:21:24,679:INFO:pegasus-monitord(773): Using DAGMan version 8.6.0 80600
2017-03-30 11:21:24,683:INFO:pegasus-monitord(759): using dag analysis8_full_run_01-0.dag
2017-03-30 11:21:24,683:INFO:Pegasus.monitoring.workflow(675): DAGMan starting with condor id 7976151.0
2017-03-30 11:21:24,691:ERROR:Pegasus.db.dashboard_loader.DashboardLoader(184): Integrity error on batch flush: (IntegrityError) columns wf_id, state, timestamp are not unique u'INSERT INTO master_workflowstate (wf_id, state, timestamp, restart_count, status, reason) VALUES (?, ?, ?, ?, ?, ?)' (4, 'WORKFLOW_STARTED', 1489251425.0, 0, None, None) - batch will need to be committed per-event which will take longer
2017-03-30 11:21:24,697:ERROR:Pegasus.db.dashboard_loader.DashboardLoader(107): Insert failed for event <class 'Pegasus.db.schema.DashboardWorkflowstate'>:
- wf_id : 4
- state : WORKFLOW_STARTED
- wf_uuid : 1772374b-a238-49da-8d22-0b563ed6269e
- timestamp : 1489251425.0
- restart_count : 0
- ts : 1489251425.0
- reason : None
- event : dashboard.xwf.start
: (IntegrityError) columns wf_id, state, timestamp are not unique u'INSERT INTO master_workflowstate (wf_id, state, timestamp, restart_count, status, reason) VALUES (?, ?, ?, ?, ?, ?)' (4, 'WORKFLOW_STARTED', 1489251425.0, 0, None, None)
2017-03-30 11:21:24,722:INFO:pegasus-monitord(1287): found new workflow to track: /usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-main_ID0000001.000/analysis8_full_run_01-main-0.dag.dagman.out
2017-03-30 11:21:24,723:INFO:Pegasus.monitoring.workflow(953): Rotating jobstate.log replay_mode 1 previous_processed_line 0
^C2017-03-30 11:21:26,303:WARNING:pegasus-monitord(873): graceful exit on signal 2
2017-03-30 11:21:26,304:WARNING:Pegasus.monitoring.workflow(1113): unable to remove recover file: /usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/monitord.recover
2017-03-30 11:21:26,304:CRITICAL:Pegasus.monitoring.workflow(1047): error processing static bp file /usr1/marton.tapai/pycbc-tmp.UgSWo5nKua/work/analysis8_full_run_01-main_ID0000001.000/analysis8_full_run_01-main-0.static.bp, exiting...
2017-03-30 11:21:26,305:CRITICAL:Pegasus.monitoring.workflow(1048): Traceback (most recent call last):
File "/usr/lib64/python2.7/site-packages/Pegasus/monitoring/workflow.py", line 1031, in _init_
my_keys = my_bp_parser.parseLine(my_line)
File "/usr/lib64/python2.7/site-packages/Pegasus/netlogger/parsers/base.py", line 593, in parseLine
fields = _bp_extract(line, validate=self.verify)
File "/usr/lib64/python2.7/site-packages/Pegasus/netlogger/parsers/base.py", line 180, in _bp_extract
for n, v, vq in BP_EXPR.findall(s):
File "/bin/pegasus-monitord", line 883, in prog_sigint_handler
sys.exit(1)
SystemExit: 1
2017-03-30 11:21:26,307:INFO:pegasus-monitord(182): DB flushing beginning
2017-03-30 11:21:26,307:INFO:Pegasus.db.workflow_loader.WorkflowLoader(1052): Executing final flush
2017-03-30 11:21:26,396:INFO:Pegasus.db.dashboard_loader.DashboardLoader(343): Executing final flush
2017-03-30 11:21:26,397:INFO:pegasus-monitord(196): DB flushing ended