Details
-
Bug
-
Resolution: Fixed
-
Major
-
master, 4.8.2
-
None
Description
In CondorIO mode, job failures result in job being held.
This is because condor is unable to find the outputs generated by the job in remote directory, and transfer_output_files fails.
However, the kickstart record is still streamed back when the job fails in the job.out file on the submit host.
however, for such workflows pegasus-analyzer does not display the task stdout and stderr even thought it was streamed back in the kickstart output
pegasus-analyzer .
2018-07-03 16:39:28,154:WARNING:Pegasus.tools.properties(239): cannot access properties file /local-scratch/vahi/work/container/submit/vahi/pegasus/population/run0004/pegasus.4005482386183897378.properties... continuing...
************************************Summary*************************************
Submit Directory : .
Total jobs : 24 (100.00%)
# jobs succeeded : 5 (20.83%)
# jobs failed : 3 (12.50%)
# jobs held : 3 (12.50%)
# jobs unsubmitted : 16 (66.67%)
*******************************Held jobs' details*******************************
=======================county_population_raster_ID0000008=======================
submit file : county_population_raster_ID0000008.sub
last_job_instance_id : 9
reason : Error from slot1@compute-6.isi.edu: STARTER at 128.9.35.234 failed to send file(s) to <128.9.44.53:9618>: error reading from /var/lib/condor/execute/dir_46553/county_level_pop_2019.tif: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.9.35.234:40949>
=======================county_population_raster_ID0000005=======================
submit file : county_population_raster_ID0000005.sub
last_job_instance_id : 11
reason : Error from slot1@compute-6.isi.edu: STARTER at 128.9.35.234 failed to send file(s) to <128.9.44.53:9618>: error reading from /var/lib/condor/execute/dir_46559/county_level_pop_2018.tif: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.9.35.234:45110>
=======================county_population_raster_ID0000002=======================
submit file : county_population_raster_ID0000002.sub
last_job_instance_id : 10
reason : Error from slot1@compute-6.isi.edu: STARTER at 128.9.35.234 failed to send file(s) to <128.9.44.53:9618>: error reading from /var/lib/condor/execute/dir_46554/county_level_pop_2017.tif: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.9.35.234:45986>
******************************Failed jobs' details******************************
=======================county_population_raster_ID0000008=======================
last state: POST_SCRIPT_FAILED
site: condorpool
submit file: 00/00/county_population_raster_ID0000008.sub
output file: 00/00/county_population_raster_ID0000008.out
error file: 00/00/county_population_raster_ID0000008.err
-------------------------------Task #1 - Summary--------------------------------
site : condorpool
hostname : compute-6.isi.edu
executable : /var/lib/condor/execute/dir_46553/county_population_raster
arguments : --config county_cohort_pop_config.ini --shapefile SouthSudan_CountyPopulation.shp --year 2019 --outfile county_level_pop_2019.tif
exitcode : 1
working dir : /local-scratch/vahi/work/container/submit/vahi/pegasus/population/run0004
=======================county_population_raster_ID0000005=======================
last state: POST_SCRIPT_FAILED
site: condorpool
submit file: 00/00/county_population_raster_ID0000005.sub
output file: 00/00/county_population_raster_ID0000005.out
error file: 00/00/county_population_raster_ID0000005.err
-------------------------------Task #1 - Summary--------------------------------
site : condorpool
hostname : compute-6.isi.edu
executable : /var/lib/condor/execute/dir_46559/county_population_raster
arguments : --config county_cohort_pop_config.ini --shapefile SouthSudan_CountyPopulation.shp --year 2018 --outfile county_level_pop_2018.tif
exitcode : 1
working dir : /local-scratch/vahi/work/container/submit/vahi/pegasus/population/run0004
=======================county_population_raster_ID0000002=======================
last state: POST_SCRIPT_FAILED
site: condorpool
submit file: 00/00/county_population_raster_ID0000002.sub
output file: 00/00/county_population_raster_ID0000002.out
error file: 00/00/county_population_raster_ID0000002.err
-------------------------------Task #1 - Summary--------------------------------
site : condorpool
hostname : compute-6.isi.edu
executable : /var/lib/condor/execute/dir_46554/county_population_raster
arguments : --config county_cohort_pop_config.ini --shapefile SouthSudan_CountyPopulation.shp --year 2017 --outfile county_level_pop_2017.tif
exitcode : 1
working dir : /local-scratch/vahi/work/container/submit/vahi/pegasus/population/run0004
This is because condor is unable to find the outputs generated by the job in remote directory, and transfer_output_files fails.
However, the kickstart record is still streamed back when the job fails in the job.out file on the submit host.
however, for such workflows pegasus-analyzer does not display the task stdout and stderr even thought it was streamed back in the kickstart output
pegasus-analyzer .
2018-07-03 16:39:28,154:WARNING:Pegasus.tools.properties(239): cannot access properties file /local-scratch/vahi/work/container/submit/vahi/pegasus/population/run0004/pegasus.4005482386183897378.properties... continuing...
************************************Summary*************************************
Submit Directory : .
Total jobs : 24 (100.00%)
# jobs succeeded : 5 (20.83%)
# jobs failed : 3 (12.50%)
# jobs held : 3 (12.50%)
# jobs unsubmitted : 16 (66.67%)
*******************************Held jobs' details*******************************
=======================county_population_raster_ID0000008=======================
submit file : county_population_raster_ID0000008.sub
last_job_instance_id : 9
reason : Error from slot1@compute-6.isi.edu: STARTER at 128.9.35.234 failed to send file(s) to <128.9.44.53:9618>: error reading from /var/lib/condor/execute/dir_46553/county_level_pop_2019.tif: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.9.35.234:40949>
=======================county_population_raster_ID0000005=======================
submit file : county_population_raster_ID0000005.sub
last_job_instance_id : 11
reason : Error from slot1@compute-6.isi.edu: STARTER at 128.9.35.234 failed to send file(s) to <128.9.44.53:9618>: error reading from /var/lib/condor/execute/dir_46559/county_level_pop_2018.tif: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.9.35.234:45110>
=======================county_population_raster_ID0000002=======================
submit file : county_population_raster_ID0000002.sub
last_job_instance_id : 10
reason : Error from slot1@compute-6.isi.edu: STARTER at 128.9.35.234 failed to send file(s) to <128.9.44.53:9618>: error reading from /var/lib/condor/execute/dir_46554/county_level_pop_2017.tif: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.9.35.234:45986>
******************************Failed jobs' details******************************
=======================county_population_raster_ID0000008=======================
last state: POST_SCRIPT_FAILED
site: condorpool
submit file: 00/00/county_population_raster_ID0000008.sub
output file: 00/00/county_population_raster_ID0000008.out
error file: 00/00/county_population_raster_ID0000008.err
-------------------------------Task #1 - Summary--------------------------------
site : condorpool
hostname : compute-6.isi.edu
executable : /var/lib/condor/execute/dir_46553/county_population_raster
arguments : --config county_cohort_pop_config.ini --shapefile SouthSudan_CountyPopulation.shp --year 2019 --outfile county_level_pop_2019.tif
exitcode : 1
working dir : /local-scratch/vahi/work/container/submit/vahi/pegasus/population/run0004
=======================county_population_raster_ID0000005=======================
last state: POST_SCRIPT_FAILED
site: condorpool
submit file: 00/00/county_population_raster_ID0000005.sub
output file: 00/00/county_population_raster_ID0000005.out
error file: 00/00/county_population_raster_ID0000005.err
-------------------------------Task #1 - Summary--------------------------------
site : condorpool
hostname : compute-6.isi.edu
executable : /var/lib/condor/execute/dir_46559/county_population_raster
arguments : --config county_cohort_pop_config.ini --shapefile SouthSudan_CountyPopulation.shp --year 2018 --outfile county_level_pop_2018.tif
exitcode : 1
working dir : /local-scratch/vahi/work/container/submit/vahi/pegasus/population/run0004
=======================county_population_raster_ID0000002=======================
last state: POST_SCRIPT_FAILED
site: condorpool
submit file: 00/00/county_population_raster_ID0000002.sub
output file: 00/00/county_population_raster_ID0000002.out
error file: 00/00/county_population_raster_ID0000002.err
-------------------------------Task #1 - Summary--------------------------------
site : condorpool
hostname : compute-6.isi.edu
executable : /var/lib/condor/execute/dir_46554/county_population_raster
arguments : --config county_cohort_pop_config.ini --shapefile SouthSudan_CountyPopulation.shp --year 2017 --outfile county_level_pop_2017.tif
exitcode : 1
working dir : /local-scratch/vahi/work/container/submit/vahi/pegasus/population/run0004