-
Type: New Feature
-
Resolution: Fixed
-
Priority: Major
-
Affects Version/s: master, 4.9.2
-
Component/s: CLI: pegasus-db-admin
-
None
Andrew Williamson a LIGO user reported this on pycbc slack
Hi all, I’ve run into an issue on the CIT cluster trying to submit a pygrb workflow, and it seems to boil down to pegasus-db-admin hanging when connecting to my ${HOME}/.pegasus/workflow.db database. Does anybody have any ideas what’s wrong here? Thanks in advance.
pegasus-plan --conf ./pegasus-properties.conf -d pygrb_offline.dax --sites local -o local --dir /local/andrew.williamson/pycbc-tmp.7pHaeqhUz8 --cleanup inplace --relative-dir work --cluster label --submit -vvv
2021.02.17 07:33:41.350 PST: [INFO] Planner invoked with following arguments --conf ./pegasus-properties.conf -d pygrb_offline.dax --sites local -o local --dir /local/andrew.williamson/pycbc-tmp.7pHaeqhUz8 --cleanup inplace --relative-dir work --cluster label --submit -vvv
:
:
:
2021.02.17 07:35:12.466 PST: -----------------------------------------------------------------------
2021.02.17 07:35:12.471 PST: File for submitting this DAG to HTCondor : pygrb_offline-0.dag.condor.sub
2021.02.17 07:35:12.476 PST: Log of DAGMan debugging messages : pygrb_offline-0.dag.dagman.out
2021.02.17 07:35:12.482 PST: Log of HTCondor library output : pygrb_offline-0.dag.lib.out
2021.02.17 07:35:12.487 PST: Log of HTCondor library error messages : pygrb_offline-0.dag.lib.err
2021.02.17 07:35:12.492 PST: Log of the life of condor_dagman itself : pygrb_offline-0.dag.dagman.log
2021.02.17 07:35:12.497 PST:
2021.02.17 07:35:12.502 PST: -no_submit given, not submitting DAG to HTCondor. You can do this with:
2021.02.17 07:35:12.512 PST: -----------------------------------------------------------------------
2021.02.17 07:35:12.518 PST: [DEBUG] condor_submit_dag exited with status 0
2021.02.17 07:35:12.522 PST: [DEBUG] Updated environment for dagman is environment = _CONDOR_SCHEDD_ADDRESS_FILE=/var/lib/condor/spool/.schedd_address;_CONDOR_MAX_DAGMAN_LOG=0;_CONDOR_SCHEDD_DAEMON_AD_FILE=/var/lib/condor/spool/.schedd_classad;_CONDOR_DAGMAN_LOG=pygrb_offline-0.dag.dagman.out;PEGASUS_METRICS=true;
2021.02.17 07:35:12.523 PST: [INFO] event.pegasus.code.generation dax.id pygrb_offline_0 (29.309 seconds) - FINISHED
2021.02.17 07:35:12.527 PST: [DEBUG] Executing /usr/bin/pegasus-db-admin update -t master -c /local/andrew.williamson/pycbc-tmp.7pHaeqhUz8/work/pegasus.5084372425070655109.properties
It doesn’t get past this point and if I try to run the pegasus-db-admin command myself in verbose mode I just see:
$ /usr/bin/pegasus-db-admin update -t master -c /local/andrew.williamson/pycbc-tmp.7pHaeqhUz8/work/pegasus.3380883461133950293.properties -vvv
2021-02-17 07:30:41,930:DEBUG:Pegasus.tools.properties(237): processing properties file /local/andrew.williamson/pycbc-tmp.7pHaeqhUz8/work/pegasus.3380883461133950293.properties...
2021-02-17 07:30:41,930:DEBUG:Pegasus.tools.properties(140): # parsing properties in <open file '/local/andrew.williamson/pycbc-tmp.7pHaeqhUz8/work/pegasus.3380883461133950293.properties', mode 'r' at 0x7f38a732b6f0>...
2021-02-17 07:30:41,931:DEBUG:Pegasus.db.connection(244): Using database: sqlite:////home/andrew.williamson/.pegasus/workflow.db
2021-02-17 07:30:41,931:DEBUG:Pegasus.db.connection(114): Connecting to: sqlite:////home/andrew.williamson/.pegasus/workflow.db with connection params as None
After that it just hangs, and top shows me the process is in uninterruptible sleep (D). Never goes anywhere. Not sure what’s happening here but looks like an issue with I/O?