pegasus-exitcode failing for clustered jobs output

XMLWordPrintable

      For my imputation workflows I see some of my clustered jobs failing because of post script failures, even though the kickstart output records all succeeded jobs

      corbusier:test vahi$ tail merge_recode_test.out
      Output file will be whi-pilot-prepped-chr-11-region-150Alt.dat
      </data>
      </statcall>
      <statcall error="0" id="stderr">
      <temporary name="/tmp/condor/lib/condor/execute/dir_2140/gs.err.4Ldhe4" descriptor="5"/>
      <statinfo mode="0100600" size="0" inode="13664316" nlink="1" blksize="4096" blocks="0" mtime="2013-04-27T14:17:30-07:00" atime="2013-04-27T14:17:30-07:00" ctime="2013-04-27T14:17:30-07:00" uid="99" user="nobody" gid="99" group="nobody"/>
      </statcall>
      </invocation>
      [cluster-task id=12, start="2013-04-27T14:17:30.507-07:00", duration=0.124, status=0, line=24, pid=2294, app="/usr/bin/recode_to_merlin.sh"]
      [cluster-summary stat="ok", lines=24, tasks=12, succeeded=12, failed=0, extra=0, duration=1.599, start="2013-04-27T14:17:29.031-07:00", pid=2238, app="pegasus-cluster"]

      When I run on the command line, I see
      corbusier:test vahi$ /tmp/pegasus-4.3.0cvs/bin/pegasus-exitcode /lfs1/work/page/work/dags/vahi/pegasus/imputation-whi-pilot/run0001/merge_recode_test.out
      Traceback (most recent call last):
      File "/tmp/pegasus-4.3.0cvs/bin/pegasus-exitcode", line 275, in <module>
      main()
      File "/tmp/pegasus-4.3.0cvs/bin/pegasus-exitcode", line 267, in main
      if pegasuslite_failures(errfile):
      File "/tmp/pegasus-4.3.0cvs/bin/pegasus-exitcode", line 194, in pegasuslite_failures
      if not os.path.isfile(errfile):
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/genericpath.py", line 29, in isfile
      st = os.stat(path)
      TypeError: coercing to Unicode: need string or buffer, NoneType found

      The really funny thing, is that when i run exitcode again on the backed up .out file it succeeds!
      corbusier:test vahi$ /tmp/pegasus-4.3.0cvs/bin/pegasus-exitcode merge_recode_test.out.000
      corbusier:test vahi$ echo $?
      0

      The same jobs as part of a workflow submitted Friday night worked.
      It could be that some of the recent kickstart changes might be the culprit.

      I see the same issue with 4.2 exitcode with the 4.3.0cvs kickstart output

        1. PM-707.tgz
          5 kB
          Karan Vahi
        2. PM-707-v2.tgz
          13 kB
          Karan Vahi

            Assignee:
            Gideon Juve (Inactive)
            Reporter:
            Karan Vahi
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: