pegasus-db-admin hangs on updating workflow db

XMLWordPrintable

    • Type: New Feature
    • Resolution: Fixed
    • Priority: Major
    • master, 4.9.2
    • Affects Version/s: master, 4.9.2
    • Component/s: CLI: pegasus-db-admin
    • None

      Andrew Williamson a LIGO user reported this on pycbc slack

      Hi all, I’ve run into an issue on the CIT cluster trying to submit a pygrb workflow, and it seems to boil down to pegasus-db-admin hanging when connecting to my ${HOME}/.pegasus/workflow.db database. Does anybody have any ideas what’s wrong here? Thanks in advance.

      pegasus-plan --conf ./pegasus-properties.conf -d pygrb_offline.dax --sites local -o local --dir /local/andrew.williamson/pycbc-tmp.7pHaeqhUz8 --cleanup inplace --relative-dir work --cluster label --submit -vvv

      2021.02.17 07:33:41.350 PST: [INFO] Planner invoked with following arguments --conf ./pegasus-properties.conf -d pygrb_offline.dax --sites local -o local --dir /local/andrew.williamson/pycbc-tmp.7pHaeqhUz8 --cleanup inplace --relative-dir work --cluster label --submit -vvv
      :
      :
      :
      2021.02.17 07:35:12.466 PST: -----------------------------------------------------------------------
      2021.02.17 07:35:12.471 PST: File for submitting this DAG to HTCondor : pygrb_offline-0.dag.condor.sub
      2021.02.17 07:35:12.476 PST: Log of DAGMan debugging messages : pygrb_offline-0.dag.dagman.out
      2021.02.17 07:35:12.482 PST: Log of HTCondor library output : pygrb_offline-0.dag.lib.out
      2021.02.17 07:35:12.487 PST: Log of HTCondor library error messages : pygrb_offline-0.dag.lib.err
      2021.02.17 07:35:12.492 PST: Log of the life of condor_dagman itself : pygrb_offline-0.dag.dagman.log
      2021.02.17 07:35:12.497 PST:
      2021.02.17 07:35:12.502 PST: -no_submit given, not submitting DAG to HTCondor. You can do this with:
      2021.02.17 07:35:12.512 PST: -----------------------------------------------------------------------
      2021.02.17 07:35:12.518 PST: [DEBUG] condor_submit_dag exited with status 0
      2021.02.17 07:35:12.522 PST: [DEBUG] Updated environment for dagman is environment = _CONDOR_SCHEDD_ADDRESS_FILE=/var/lib/condor/spool/.schedd_address;_CONDOR_MAX_DAGMAN_LOG=0;_CONDOR_SCHEDD_DAEMON_AD_FILE=/var/lib/condor/spool/.schedd_classad;_CONDOR_DAGMAN_LOG=pygrb_offline-0.dag.dagman.out;PEGASUS_METRICS=true;
      2021.02.17 07:35:12.523 PST: [INFO] event.pegasus.code.generation dax.id pygrb_offline_0 (29.309 seconds) - FINISHED
      2021.02.17 07:35:12.527 PST: [DEBUG] Executing /usr/bin/pegasus-db-admin update -t master -c /local/andrew.williamson/pycbc-tmp.7pHaeqhUz8/work/pegasus.5084372425070655109.properties
      It doesn’t get past this point and if I try to run the pegasus-db-admin command myself in verbose mode I just see:
      $ /usr/bin/pegasus-db-admin update -t master -c /local/andrew.williamson/pycbc-tmp.7pHaeqhUz8/work/pegasus.3380883461133950293.properties -vvv

      2021-02-17 07:30:41,930:DEBUG:Pegasus.tools.properties(237): processing properties file /local/andrew.williamson/pycbc-tmp.7pHaeqhUz8/work/pegasus.3380883461133950293.properties...
      2021-02-17 07:30:41,930:DEBUG:Pegasus.tools.properties(140): # parsing properties in <open file '/local/andrew.williamson/pycbc-tmp.7pHaeqhUz8/work/pegasus.3380883461133950293.properties', mode 'r' at 0x7f38a732b6f0>...
      2021-02-17 07:30:41,931:DEBUG:Pegasus.db.connection(244): Using database: sqlite:////home/andrew.williamson/.pegasus/workflow.db
      2021-02-17 07:30:41,931:DEBUG:Pegasus.db.connection(114): Connecting to: sqlite:////home/andrew.williamson/.pegasus/workflow.db with connection params as None

      After that it just hangs, and top shows me the process is in uninterruptible sleep (D). Never goes anywhere. Not sure what’s happening here but looks like an issue with I/O?

            Assignee:
            Rafael Ferreira Da Silva (Inactive)
            Reporter:
            Karan Vahi
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: