-
Type: Bug
-
Resolution: Fixed
-
Priority: Major
-
Affects Version/s: master, 4.5.0
-
Component/s: Pegasus Planner
-
None
Thanks for looking at this. I think I have tracked down the cause of this problem:
- First pipedown is not idempotent, since it overwrites its own input files. In ahope this has gone away, so that should not be a problem once pipedown is fully retired.
- Second, it looks like there is a bug (or change of behavior) in either Pegasus or Condor with sub-dags in workflows. The version of the code that Sam, Marcel, and I have been running runs pipedown as a sub-dag from the top-level workflow with three retries. The top-level dag generated by pegasus has:
SUBDAG EXTERNAL subdag_pipedown_ID0141011 /home/dbrown/projects/students/samantha_usman/ahope-comparison-paper/ahope-same-harm-exact-nomax-nosubbank/962582415-963187215/pipedown/pipedown.dag DIR /home/dbrown/projects/students/samantha_usman/ahope-comparison-paper/ahope-same-harm-exact-nomax-nosubbank/962582415-963187215/pipedown
RETRY subdag_pipedown_ID0141011 3
CATEGORY subdag_pipedown_ID0141011 subwf
Each of these retries properly uses the rescue sub-dag, as I see the log contains
06/13/15 21:02:35 Found rescue DAG number 1; running /home/dbrown/projects/students/samantha_usman/ahope-comparison-paper/ahope-same-harm-exact-nomax-nosubbank/962582415-963187215/pipedown/pipedown.dag.rescue001 in combination with normal DAG file
And the last of these three retries writes out a rescue dag correctly:
06/14/15 19:05:04 Writing Rescue DAG to /home/dbrown/projects/students/samantha_usman/ahope-comparison-paper/ahope-same-harm-exact-nomax-nosubbank/962582415-963187215/pipedown/pipedown.dag.rescue004...
Since the three re-tries failed, this gets reported to the top-level DAG as a failed sub-dag and appears in the top gglevel rescue DAG. However, when I re-submit the top-level DAG again with pegasus run, the sub-dab gets run with
06/15/15 15:41:40 argv[12] == "-Forceā
which erases the old rescue DAGs and starts pipedown again from scratch. Then it hits the pipedown-is-not-idempotent problem and can add injections to the database again, depending on where and how the top-level DAGs failed.