Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-737

pegasus-exitcode extensions to detect CondorG/GRAM exitcode propagation problem

XMLWordPrintable

    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Major Major
    • 4.4.0
    • Affects Version/s: None
    • Component/s: None
    • None

      Per HTcondor admin ticket #26573, we have on record that CondorG does not propagate exit codes from GRAM correctly. Hence, we need to come up with a way to detect failures in this case. Usually it is not a problem for Pegasus as most of the jobs are wrapped by pegasus-kickstart. However, that does not apply to MPI jobs launched by Pegasus . The ones that have jobtype=mpi in their globus rsl.

      For that particular case and some others, we are going to add profiles that users can associate with jobs to indicate failure. these identify the string messages that we will search for in the stdout and stderr of the job in pegasus-exitcode
      the two pegasus profile keys are

      • exitcode.failuremsg the failure string
      • exitcode.successmsg the success string.

      Note: the exitcode.success string is only to indicate failure i.e absence of that string in stdout will indicate failure.

      The rules for pegasus-exticode failing a job will be

      • failure string in job stdout/stderr
      • absence of a success string in job stdout/stderr if specified as a profile with the job
      • empty stdout file
      • non zero exit codes from kickstart records ( if provided )

      The absence of success string is required to handle the case, where lets say GRAM fails to setup and start the MPI/user application code correctly. In that case, we will have no failure messages, but still this is a failure that we want to detect

            Assignee:
            vahi Karan Vahi
            Reporter:
            vahi Karan Vahi
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: