-
Type: Bug
-
Resolution: Fixed
-
Priority: Critical
-
Affects Version/s: master
-
Component/s: CLI: exitcode/exitpost
-
None
For my imputation workflows I see some of my clustered jobs failing because of post script failures, even though the kickstart output records all succeeded jobs
corbusier:test vahi$ tail merge_recode_test.out
Output file will be whi-pilot-prepped-chr-11-region-150Alt.dat
</data>
</statcall>
<statcall error="0" id="stderr">
<temporary name="/tmp/condor/lib/condor/execute/dir_2140/gs.err.4Ldhe4" descriptor="5"/>
<statinfo mode="0100600" size="0" inode="13664316" nlink="1" blksize="4096" blocks="0" mtime="2013-04-27T14:17:30-07:00" atime="2013-04-27T14:17:30-07:00" ctime="2013-04-27T14:17:30-07:00" uid="99" user="nobody" gid="99" group="nobody"/>
</statcall>
</invocation>
[cluster-task id=12, start="2013-04-27T14:17:30.507-07:00", duration=0.124, status=0, line=24, pid=2294, app="/usr/bin/recode_to_merlin.sh"]
[cluster-summary stat="ok", lines=24, tasks=12, succeeded=12, failed=0, extra=0, duration=1.599, start="2013-04-27T14:17:29.031-07:00", pid=2238, app="pegasus-cluster"]
When I run on the command line, I see
corbusier:test vahi$ /tmp/pegasus-4.3.0cvs/bin/pegasus-exitcode /lfs1/work/page/work/dags/vahi/pegasus/imputation-whi-pilot/run0001/merge_recode_test.out
Traceback (most recent call last):
File "/tmp/pegasus-4.3.0cvs/bin/pegasus-exitcode", line 275, in <module>
main()
File "/tmp/pegasus-4.3.0cvs/bin/pegasus-exitcode", line 267, in main
if pegasuslite_failures(errfile):
File "/tmp/pegasus-4.3.0cvs/bin/pegasus-exitcode", line 194, in pegasuslite_failures
if not os.path.isfile(errfile):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/genericpath.py", line 29, in isfile
st = os.stat(path)
TypeError: coercing to Unicode: need string or buffer, NoneType found
The really funny thing, is that when i run exitcode again on the backed up .out file it succeeds!
corbusier:test vahi$ /tmp/pegasus-4.3.0cvs/bin/pegasus-exitcode merge_recode_test.out.000
corbusier:test vahi$ echo $?
0
The same jobs as part of a workflow submitted Friday night worked.
It could be that some of the recent kickstart changes might be the culprit.
I see the same issue with 4.2 exitcode with the 4.3.0cvs kickstart output