Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1061

pegasus-analyzer should detect and report on failed job submissions

    XMLWordPrintable

    Details

      Description

      When DAGMan is not able to submit a job to condor, the workflow fails, but pegasus-analyzer does not provide any useful information:

      $ pegasus-analyzer .

      ************************************Summary*************************************

       Submit Directory : .
       Total jobs : 13 (100.00%)
       # jobs succeeded : 3 (23.08%)
       # jobs failed : 1 (7.69%)
       # jobs unsubmitted : 9 (69.23%)

      ******************************Failed jobs' details******************************

      ==============================preprocess_ID0000001==============================

       last state: -
             site: -
      submit file: preprocess_ID0000001.sub
      output file: -
       error file: -

      -------------------------------Task #1 - Summary--------------------------------

      site : -
      hostname : -
      executable : /tmp/tutorial/submit/juve/pegasus/diamond/run0001/preprocess_ID0000001.sh
      arguments : -
      exitcode : -1
      working dir : -

      In this case the cause is in the dagman.out file:

      02/18/16 08:03:00 Submitting Condor Node preprocess_ID0000001 job(s)...
      02/18/16 08:03:00 Adding a DAGMan workflow log /private/tmp/tutorial/submit/juve/pegasus/diamond/run0001/./diamond-0.dag.nodes.log
      02/18/16 08:03:00 Masking the events recorded in the DAGMAN workflow log
      02/18/16 08:03:00 Mask for workflow log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27
      02/18/16 08:03:00 submitting: /usr/local/bin/condor_submit -a dag_node_name' '=' 'preprocess_ID0000001 -a +DAGManJobId' '=' '46 -a DAGManJobId' '=' '46 -a submit_event_notes' '=' 'DAG' 'Node:' 'preprocess_ID0000001 -a dagman_log' '=' '/private/tmp/tutorial/submit/juve/pegasus/diamond/run0001/./diamond-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"stage_in_remote_local_0_0" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never preprocess_ID0000001.sub
      02/18/16 08:03:00 From submit: Submitting job(s)
      02/18/16 08:03:00 From submit: ERROR: No such directory: /tmp/tutorial/juve/pegasus/diamond/run0001
      02/18/16 08:03:00 failed while reading from pipe.
      02/18/16 08:03:00 Read so far: Submitting job(s)ERROR: No such directory: /tmp/tutorial/juve/pegasus/diamond/run0001
      02/18/16 08:03:00 ERROR: submit attempt failed
      02/18/16 08:03:00 submit command was: /usr/local/bin/condor_submit -a dag_node_name' '=' 'preprocess_ID0000001 -a +DAGManJobId' '=' '46 -a DAGManJobId' '=' '46 -a submit_event_notes' '=' 'DAG' 'Node:' 'preprocess_ID0000001 -a dagman_log' '=' '/private/tmp/tutorial/submit/juve/pegasus/diamond/run0001/./diamond-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"stage_in_remote_local_0_0" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never preprocess_ID0000001.sub
      02/18/16 08:03:00 Job submit try 1/6 failed, will try again in >= 1 second.

        Attachments

          Activity

            People

            • Assignee:
              vahi Karan Vahi
              Reporter:
              gideon Gideon Juve (Inactive)
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: