-
Type: Bug
-
Resolution: Fixed
-
Priority: Major
-
Affects Version/s: 4.5.0
-
Component/s: CLI: exitcode/exitpost
-
None
Vickie had a problem with pegasus-exitcode today. She had a lot of jobs that should have been failures, but weren't marked as such by pegasus-exitcode. They had output looking like this:
insufficient allocation - please contact your PI
+ --------------------------------------------------------------------------
+ Job name: STDIN
+ Job Id: 9230114.hopque01
+ System: hopper
+ Queued Time: Fri May 15 20:30:06 2015
+ Start Time: Sat May 16 11:12:19 2015
+ Completion Time: Sat May 16 11:12:19 2015
+ User: vlynch
+ MOM Host: nid05416
+ Queue: reg_long
+ Req. Resources: mppnodect=34,mppnppn=24,mppwidth=800,walltime=96:00:00
+ Used Resources:
+ Acct String: m1503
+ PBS_O_WORKDIR: /scratch/scratchdirs/vlynch/lynchve/pegasus/refinement/run0001
+ Submit Args:
+ --------------------------------------------------------------------------
My first thought was that this should have been a GRAM failure, but I will have to follow up with NERSC about that to see where that error message is coming from.
The other issue is that this job (and similar jobs) had a success message of "End of program" that wasn't present in the output. The postscript command looked like this:
/usr/bin/pegasus-exitcode -s End+of+program -r $RETURN /ccg/home/lynchve/SNS-Nanodiamond/8ND300Kscan2/submit/lynchve/pegasus/refinement/run0001/namd_ID0000003.out
The problem is the "-r $RETURN". I assume that is added whenever Kickstart is not used. Is that correct?
The question is: What should we do if exitcode gets "-r 0"? Currently exitcode ignores the other tests if it gets "-r 0". Maybe we should modify it so that it only ignores the invocation record tests if it gets "-r"? Maybe it shouldn't ignore anything and we should have a different flag for "no invocation record(s) expected"?