-
Type: Bug
-
Resolution: Fixed
-
Priority: Major
-
Affects Version/s: master, 4.8.2
-
Component/s: Monitord, statistics visualization and debugging tools
-
None
In CondorIO mode, job failures result in job being held.
This is because condor is unable to find the outputs generated by the job in remote directory, and transfer_output_files fails.
However, the kickstart record is still streamed back when the job fails in the job.out file on the submit host.
however, for such workflows pegasus-analyzer does not display the task stdout and stderr even thought it was streamed back in the kickstart output
pegasus-analyzer .
2018-07-03 16:39:28,154:WARNING:Pegasus.tools.properties(239): cannot access properties file /local-scratch/vahi/work/container/submit/vahi/pegasus/population/run0004/pegasus.4005482386183897378.properties... continuing...
***********************************Summary************************************
Submit Directory : .
Total jobs : 24 (100.00%)
- jobs succeeded : 5 (20.83%)
- jobs failed : 3 (12.50%)
- jobs held : 3 (12.50%)
- jobs unsubmitted : 16 (66.67%)
******************************Held jobs' details******************************
=======================county_population_raster_ID0000008=======================
submit file : county_population_raster_ID0000008.sub
last_job_instance_id : 9
reason : Error from slot1@compute-6.isi.edu: STARTER at 128.9.35.234 failed to send file(s) to <128.9.44.53:9618>: error reading from /var/lib/condor/execute/dir_46553/county_level_pop_2019.tif: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.9.35.234:40949>
=======================county_population_raster_ID0000005=======================
submit file : county_population_raster_ID0000005.sub
last_job_instance_id : 11
reason : Error from slot1@compute-6.isi.edu: STARTER at 128.9.35.234 failed to send file(s) to <128.9.44.53:9618>: error reading from /var/lib/condor/execute/dir_46559/county_level_pop_2018.tif: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.9.35.234:45110>
=======================county_population_raster_ID0000002=======================
submit file : county_population_raster_ID0000002.sub
last_job_instance_id : 10
reason : Error from slot1@compute-6.isi.edu: STARTER at 128.9.35.234 failed to send file(s) to <128.9.44.53:9618>: error reading from /var/lib/condor/execute/dir_46554/county_level_pop_2017.tif: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.9.35.234:45986>
*****************************Failed jobs' details*****************************
=======================county_population_raster_ID0000008=======================
last state: POST_SCRIPT_FAILED
site: condorpool
submit file: 00/00/county_population_raster_ID0000008.sub
output file: 00/00/county_population_raster_ID0000008.out
error file: 00/00/county_population_raster_ID0000008.err
------------------------------Task #1 - Summary-------------------------------
site : condorpool
hostname : compute-6.isi.edu
executable : /var/lib/condor/execute/dir_46553/county_population_raster
arguments : --config county_cohort_pop_config.ini --shapefile SouthSudan_CountyPopulation.shp --year 2019 --outfile county_level_pop_2019.tif
exitcode : 1
working dir : /local-scratch/vahi/work/container/submit/vahi/pegasus/population/run0004
=======================county_population_raster_ID0000005=======================
last state: POST_SCRIPT_FAILED
site: condorpool
submit file: 00/00/county_population_raster_ID0000005.sub
output file: 00/00/county_population_raster_ID0000005.out
error file: 00/00/county_population_raster_ID0000005.err
------------------------------Task #1 - Summary-------------------------------
site : condorpool
hostname : compute-6.isi.edu
executable : /var/lib/condor/execute/dir_46559/county_population_raster
arguments : --config county_cohort_pop_config.ini --shapefile SouthSudan_CountyPopulation.shp --year 2018 --outfile county_level_pop_2018.tif
exitcode : 1
working dir : /local-scratch/vahi/work/container/submit/vahi/pegasus/population/run0004
=======================county_population_raster_ID0000002=======================
last state: POST_SCRIPT_FAILED
site: condorpool
submit file: 00/00/county_population_raster_ID0000002.sub
output file: 00/00/county_population_raster_ID0000002.out
error file: 00/00/county_population_raster_ID0000002.err
------------------------------Task #1 - Summary-------------------------------
site : condorpool
hostname : compute-6.isi.edu
executable : /var/lib/condor/execute/dir_46554/county_population_raster
arguments : --config county_cohort_pop_config.ini --shapefile SouthSudan_CountyPopulation.shp --year 2017 --outfile county_level_pop_2017.tif
exitcode : 1
working dir : /local-scratch/vahi/work/container/submit/vahi/pegasus/population/run0004