Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-949

rescue dag are not submitted correctly for the dag jobs

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: master, 4.5.0
    • Fix Version/s: master, 4.6.0, 4.5.1
    • Component/s: Pegasus Planner
    • Labels:
      None

      Description

      Thanks for looking at this. I think I have tracked down the cause of this problem:

      - First pipedown is not idempotent, since it overwrites its own input files. In ahope this has gone away, so that should not be a problem once pipedown is fully retired.

      - Second, it looks like there is a bug (or change of behavior) in either Pegasus or Condor with sub-dags in workflows. The version of the code that Sam, Marcel, and I have been running runs pipedown as a sub-dag from the top-level workflow with three retries. The top-level dag generated by pegasus has:

      SUBDAG EXTERNAL subdag_pipedown_ID0141011 /home/dbrown/projects/students/samantha_usman/ahope-comparison-paper/ahope-same-harm-exact-nomax-nosubbank/962582415-963187215/pipedown/pipedown.dag DIR /home/dbrown/projects/students/samantha_usman/ahope-comparison-paper/ahope-same-harm-exact-nomax-nosubbank/962582415-963187215/pipedown
      RETRY subdag_pipedown_ID0141011 3
      CATEGORY subdag_pipedown_ID0141011 subwf

      Each of these retries properly uses the rescue sub-dag, as I see the log contains

      06/13/15 21:02:35 Found rescue DAG number 1; running /home/dbrown/projects/students/samantha_usman/ahope-comparison-paper/ahope-same-harm-exact-nomax-nosubbank/962582415-963187215/pipedown/pipedown.dag.rescue001 in combination with normal DAG file

      And the last of these three retries writes out a rescue dag correctly:

      06/14/15 19:05:04 Writing Rescue DAG to /home/dbrown/projects/students/samantha_usman/ahope-comparison-paper/ahope-same-harm-exact-nomax-nosubbank/962582415-963187215/pipedown/pipedown.dag.rescue004...

      Since the three re-tries failed, this gets reported to the top-level DAG as a failed sub-dag and appears in the top gglevel rescue DAG. However, when I re-submit the top-level DAG again with pegasus run, the sub-dab gets run with

      06/15/15 15:41:40 argv[12] == "-Forceā€

      which erases the old rescue DAGs and starts pipedown again from scratch. Then it hits the pipedown-is-not-idempotent problem and can add injections to the database again, depending on where and how the top-level DAGs failed.

        Attachments

          Activity

            People

            • Assignee:
              vahi Karan Vahi
              Reporter:
              dbrown Duncan Brown
            • Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: