incorrect ( malformed) rescue dag gets submitted in case planner die because of memory related issue

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

XMLWordPrintable

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major
    • master, 4.6.0, 4.5.1
    • Affects Version/s: master, 4.5.0
    • Component/s: Pegasus Planner
    • None

      Hi Larne,

      The dag file:

      /local/user/lppekows/pycbc-tmp.dRq2USWe1Y/work/main_ID0000001/main-0.dag

      on atlas2 is invalid. If you look at the end of the file you can see a
      partial entry, which suggests that the process writing the dag
      terminated ... as the PARENT...CHILD entries are written last, this
      file has none of them. If you look in:

      /local/user/lppekows/pycbc-tmp.dRq2USWe1Y/work/subdax_main_ID0000001.pre.log.000

      this theory seems to be confirmed: A failure message for the dag
      writing process is given (out of memory!).

      However when this runs a second time in:

      /local/user/lppekows/pycbc-tmp.dRq2USWe1Y/work/subdax_main_ID0000001.pre.log.001

      it sees the existing .dag file and just tries to submit it. That looks
      like a bug in pegasus.

      Cheers
      Ian

      On 24 July 2015 at 04:06, Larne Pekowsky <lppekows@syr.edu> wrote:
      Hi all,

      I have a workflow on atlas, started from

      /home/lppekows/projects/cbc/pycbc1.1_review/analysis8_ahope-same-harm-exact-nomax-nosubbank/962582415-963187215

      and running in

      /local/user/lppekows/pycbc-tmp.dRq2USWe1Y/work

      It looks like none of the inspiral jobs were scheduled. They’re in
      main-0.dag, but there are no inspiral*out* or inspiral*err* files, the
      workflow seems to have just jumped directly to the llwadd jobs.

      Has anyone seen anything like this before?

      Thanks,

      • Larne

            Assignee:
            Karan Vahi
            Reporter:
            Duncan Brown
            Archiver:
            Rajiv Mayani

              Created:
              Updated:
              Resolved:
              Archived: