incorrect ( malformed) rescue dag gets submitted in case planner die because of memory related issue

XMLWordPrintable

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major
    • master, 4.6.0, 4.5.1
    • Affects Version/s: master, 4.5.0
    • Component/s: Pegasus Planner
    • None

      Hi Larne,

      The dag file:

      /local/user/lppekows/pycbc-tmp.dRq2USWe1Y/work/main_ID0000001/main-0.dag

      on atlas2 is invalid. If you look at the end of the file you can see a
      partial entry, which suggests that the process writing the dag
      terminated ... as the PARENT...CHILD entries are written last, this
      file has none of them. If you look in:

      /local/user/lppekows/pycbc-tmp.dRq2USWe1Y/work/subdax_main_ID0000001.pre.log.000

      this theory seems to be confirmed: A failure message for the dag
      writing process is given (out of memory!).

      However when this runs a second time in:

      /local/user/lppekows/pycbc-tmp.dRq2USWe1Y/work/subdax_main_ID0000001.pre.log.001

      it sees the existing .dag file and just tries to submit it. That looks
      like a bug in pegasus.

      Cheers
      Ian

      On 24 July 2015 at 04:06, Larne Pekowsky <lppekows@syr.edu> wrote:
      Hi all,

      I have a workflow on atlas, started from

      /home/lppekows/projects/cbc/pycbc1.1_review/analysis8_ahope-same-harm-exact-nomax-nosubbank/962582415-963187215

      and running in

      /local/user/lppekows/pycbc-tmp.dRq2USWe1Y/work

      It looks like none of the inspiral jobs were scheduled. They’re in
      main-0.dag, but there are no inspiral*out* or inspiral*err* files, the
      workflow seems to have just jumped directly to the llwadd jobs.

      Has anyone seen anything like this before?

      Thanks,

      • Larne

            Assignee:
            Karan Vahi
            Reporter:
            Duncan Brown
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: