Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1281

pegasus-analyzer does not show task stdout/stderr for held jobs

XMLWordPrintable

      In CondorIO mode, job failures result in job being held.
      This is because condor is unable to find the outputs generated by the job in remote directory, and transfer_output_files fails.

      However, the kickstart record is still streamed back when the job fails in the job.out file on the submit host.

      however, for such workflows pegasus-analyzer does not display the task stdout and stderr even thought it was streamed back in the kickstart output

      pegasus-analyzer .
      2018-07-03 16:39:28,154:WARNING:Pegasus.tools.properties(239): cannot access properties file /local-scratch/vahi/work/container/submit/vahi/pegasus/population/run0004/pegasus.4005482386183897378.properties... continuing...

      ***********************************Summary************************************

      Submit Directory : .
      Total jobs : 24 (100.00%)

      1. jobs succeeded : 5 (20.83%)
      2. jobs failed : 3 (12.50%)
      3. jobs held : 3 (12.50%)
      4. jobs unsubmitted : 16 (66.67%)

      ******************************Held jobs' details******************************

      =======================county_population_raster_ID0000008=======================

      submit file : county_population_raster_ID0000008.sub
      last_job_instance_id : 9
      reason : Error from slot1@compute-6.isi.edu: STARTER at 128.9.35.234 failed to send file(s) to <128.9.44.53:9618>: error reading from /var/lib/condor/execute/dir_46553/county_level_pop_2019.tif: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.9.35.234:40949>

      =======================county_population_raster_ID0000005=======================

      submit file : county_population_raster_ID0000005.sub
      last_job_instance_id : 11
      reason : Error from slot1@compute-6.isi.edu: STARTER at 128.9.35.234 failed to send file(s) to <128.9.44.53:9618>: error reading from /var/lib/condor/execute/dir_46559/county_level_pop_2018.tif: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.9.35.234:45110>

      =======================county_population_raster_ID0000002=======================

      submit file : county_population_raster_ID0000002.sub
      last_job_instance_id : 10
      reason : Error from slot1@compute-6.isi.edu: STARTER at 128.9.35.234 failed to send file(s) to <128.9.44.53:9618>: error reading from /var/lib/condor/execute/dir_46554/county_level_pop_2017.tif: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.9.35.234:45986>

      *****************************Failed jobs' details*****************************

      =======================county_population_raster_ID0000008=======================

      last state: POST_SCRIPT_FAILED
      site: condorpool
      submit file: 00/00/county_population_raster_ID0000008.sub
      output file: 00/00/county_population_raster_ID0000008.out
      error file: 00/00/county_population_raster_ID0000008.err

      ------------------------------Task #1 - Summary-------------------------------

      site : condorpool
      hostname : compute-6.isi.edu
      executable : /var/lib/condor/execute/dir_46553/county_population_raster
      arguments : --config county_cohort_pop_config.ini --shapefile SouthSudan_CountyPopulation.shp --year 2019 --outfile county_level_pop_2019.tif
      exitcode : 1
      working dir : /local-scratch/vahi/work/container/submit/vahi/pegasus/population/run0004

      =======================county_population_raster_ID0000005=======================

      last state: POST_SCRIPT_FAILED
      site: condorpool
      submit file: 00/00/county_population_raster_ID0000005.sub
      output file: 00/00/county_population_raster_ID0000005.out
      error file: 00/00/county_population_raster_ID0000005.err

      ------------------------------Task #1 - Summary-------------------------------

      site : condorpool
      hostname : compute-6.isi.edu
      executable : /var/lib/condor/execute/dir_46559/county_population_raster
      arguments : --config county_cohort_pop_config.ini --shapefile SouthSudan_CountyPopulation.shp --year 2018 --outfile county_level_pop_2018.tif
      exitcode : 1
      working dir : /local-scratch/vahi/work/container/submit/vahi/pegasus/population/run0004

      =======================county_population_raster_ID0000002=======================

      last state: POST_SCRIPT_FAILED
      site: condorpool
      submit file: 00/00/county_population_raster_ID0000002.sub
      output file: 00/00/county_population_raster_ID0000002.out
      error file: 00/00/county_population_raster_ID0000002.err

      ------------------------------Task #1 - Summary-------------------------------

      site : condorpool
      hostname : compute-6.isi.edu
      executable : /var/lib/condor/execute/dir_46554/county_population_raster
      arguments : --config county_cohort_pop_config.ini --shapefile SouthSudan_CountyPopulation.shp --year 2017 --outfile county_level_pop_2017.tif
      exitcode : 1
      working dir : /local-scratch/vahi/work/container/submit/vahi/pegasus/population/run0004

            Assignee:
            vahi Karan Vahi
            Reporter:
            vahi Karan Vahi
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: