Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-2

Jobs hang with seqexec


    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major Major
    • None
    • Affects Version/s: 2.0
    • Component/s: CLI: pegasus-run
    • None
    • Environment:
      Operating System: All
      Platform: All
    • 2

      Clustered jobs will occasionally hang when using seqexec.

      The problem appears to be the call to kill() near line 695 in seqexec.c. The arguments look like they have been reversed. The line reads:

      if ( kill( 0, child ) == 0 )

      But should read:

      if ( kill( child, 0 ) == 0 )

      The former causes the child's process id to be sent as a signal to the entire process group. The latter causes no signal to be sent to the child process, but returns 0 if the process is still alive (the desired result).

      The problem occurs when the process ids on the remote machine wrap around and come within the range of the SIG* signals. It looks like some harmful signals have been blocked, but if the child process id is 20 the SIGSTOP signal (which cannot be blocked) is sent and the job hangs.

            gmehta Gaurang Mehta (Inactive)
            gideon Gideon Juve (Inactive)
            2 Start watching this issue