Jobs hang with seqexec

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

XMLWordPrintable

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major
    • None
    • Affects Version/s: 2.0
    • Component/s: CLI: pegasus-run
    • None
    • Environment:
      Operating System: All
      Platform: All
    • 2

      Clustered jobs will occasionally hang when using seqexec.

      The problem appears to be the call to kill() near line 695 in seqexec.c. The arguments look like they have been reversed. The line reads:

      if ( kill( 0, child ) == 0 )

      But should read:

      if ( kill( child, 0 ) == 0 )

      The former causes the child's process id to be sent as a signal to the entire process group. The latter causes no signal to be sent to the child process, but returns 0 if the process is still alive (the desired result).

      The problem occurs when the process ids on the remote machine wrap around and come within the range of the SIG* signals. It looks like some harmful signals have been blocked, but if the child process id is 20 the SIGSTOP signal (which cannot be blocked) is sent and the job hangs.

            Assignee:
            Gaurang Mehta (Inactive)
            Reporter:
            Gideon Juve
            Archiver:
            Rajiv Mayani

              Created:
              Updated:
              Resolved:
              Archived: