Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1061

pegasus-analyzer should detect and report on failed job submissions

XMLWordPrintable

      When DAGMan is not able to submit a job to condor, the workflow fails, but pegasus-analyzer does not provide any useful information:

      $ pegasus-analyzer .

      ***********************************Summary************************************

      Submit Directory : .
      Total jobs : 13 (100.00%)

      1. jobs succeeded : 3 (23.08%)
      2. jobs failed : 1 (7.69%)
      3. jobs unsubmitted : 9 (69.23%)

      *****************************Failed jobs' details*****************************

      ==============================preprocess_ID0000001==============================

      last state: -
      site: -
      submit file: preprocess_ID0000001.sub
      output file: -
      error file: -

      ------------------------------Task #1 - Summary-------------------------------

      site : -
      hostname : -
      executable : /tmp/tutorial/submit/juve/pegasus/diamond/run0001/preprocess_ID0000001.sh
      arguments : -
      exitcode : -1
      working dir : -

      In this case the cause is in the dagman.out file:

      02/18/16 08:03:00 Submitting Condor Node preprocess_ID0000001 job(s)...
      02/18/16 08:03:00 Adding a DAGMan workflow log /private/tmp/tutorial/submit/juve/pegasus/diamond/run0001/./diamond-0.dag.nodes.log
      02/18/16 08:03:00 Masking the events recorded in the DAGMAN workflow log
      02/18/16 08:03:00 Mask for workflow log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27
      02/18/16 08:03:00 submitting: /usr/local/bin/condor_submit -a dag_node_name' '=' 'preprocess_ID0000001 -a +DAGManJobId' '=' '46 -a DAGManJobId' '=' '46 -a submit_event_notes' '=' 'DAG' 'Node:' 'preprocess_ID0000001 -a dagman_log' '=' '/private/tmp/tutorial/submit/juve/pegasus/diamond/run0001/./diamond-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"stage_in_remote_local_0_0" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never preprocess_ID0000001.sub
      02/18/16 08:03:00 From submit: Submitting job(s)
      02/18/16 08:03:00 From submit: ERROR: No such directory: /tmp/tutorial/juve/pegasus/diamond/run0001
      02/18/16 08:03:00 failed while reading from pipe.
      02/18/16 08:03:00 Read so far: Submitting job(s)ERROR: No such directory: /tmp/tutorial/juve/pegasus/diamond/run0001
      02/18/16 08:03:00 ERROR: submit attempt failed
      02/18/16 08:03:00 submit command was: /usr/local/bin/condor_submit -a dag_node_name' '=' 'preprocess_ID0000001 -a +DAGManJobId' '=' '46 -a DAGManJobId' '=' '46 -a submit_event_notes' '=' 'DAG' 'Node:' 'preprocess_ID0000001 -a dagman_log' '=' '/private/tmp/tutorial/submit/juve/pegasus/diamond/run0001/./diamond-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"stage_in_remote_local_0_0" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never preprocess_ID0000001.sub
      02/18/16 08:03:00 Job submit try 1/6 failed, will try again in >= 1 second.

            Assignee:
            vahi Karan Vahi
            Reporter:
            gideon Gideon Juve (Inactive)
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: