pegasus-analyzer should detect and report on failed job submissions

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

XMLWordPrintable

      When DAGMan is not able to submit a job to condor, the workflow fails, but pegasus-analyzer does not provide any useful information:

      $ pegasus-analyzer .

      ***********************************Summary************************************

      Submit Directory : .
      Total jobs : 13 (100.00%)

      1. jobs succeeded : 3 (23.08%)
      2. jobs failed : 1 (7.69%)
      3. jobs unsubmitted : 9 (69.23%)

      *****************************Failed jobs' details*****************************

      ==============================preprocess_ID0000001==============================

      last state: -
      site: -
      submit file: preprocess_ID0000001.sub
      output file: -
      error file: -

      ------------------------------Task #1 - Summary-------------------------------

      site : -
      hostname : -
      executable : /tmp/tutorial/submit/juve/pegasus/diamond/run0001/preprocess_ID0000001.sh
      arguments : -
      exitcode : -1
      working dir : -

      In this case the cause is in the dagman.out file:

      02/18/16 08:03:00 Submitting Condor Node preprocess_ID0000001 job(s)...
      02/18/16 08:03:00 Adding a DAGMan workflow log /private/tmp/tutorial/submit/juve/pegasus/diamond/run0001/./diamond-0.dag.nodes.log
      02/18/16 08:03:00 Masking the events recorded in the DAGMAN workflow log
      02/18/16 08:03:00 Mask for workflow log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27
      02/18/16 08:03:00 submitting: /usr/local/bin/condor_submit -a dag_node_name' '=' 'preprocess_ID0000001 -a +DAGManJobId' '=' '46 -a DAGManJobId' '=' '46 -a submit_event_notes' '=' 'DAG' 'Node:' 'preprocess_ID0000001 -a dagman_log' '=' '/private/tmp/tutorial/submit/juve/pegasus/diamond/run0001/./diamond-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"stage_in_remote_local_0_0" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never preprocess_ID0000001.sub
      02/18/16 08:03:00 From submit: Submitting job(s)
      02/18/16 08:03:00 From submit: ERROR: No such directory: /tmp/tutorial/juve/pegasus/diamond/run0001
      02/18/16 08:03:00 failed while reading from pipe.
      02/18/16 08:03:00 Read so far: Submitting job(s)ERROR: No such directory: /tmp/tutorial/juve/pegasus/diamond/run0001
      02/18/16 08:03:00 ERROR: submit attempt failed
      02/18/16 08:03:00 submit command was: /usr/local/bin/condor_submit -a dag_node_name' '=' 'preprocess_ID0000001 -a +DAGManJobId' '=' '46 -a DAGManJobId' '=' '46 -a submit_event_notes' '=' 'DAG' 'Node:' 'preprocess_ID0000001 -a dagman_log' '=' '/private/tmp/tutorial/submit/juve/pegasus/diamond/run0001/./diamond-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"stage_in_remote_local_0_0" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never preprocess_ID0000001.sub
      02/18/16 08:03:00 Job submit try 1/6 failed, will try again in >= 1 second.

            Assignee:
            Karan Vahi
            Reporter:
            Gideon Juve
            Archiver:
            Rajiv Mayani

              Created:
              Updated:
              Resolved:
              Archived: