Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1049

Jobs should not be retried immediately, but rather delayed for some time


    • Type: Icon: Improvement Improvement
    • Resolution: Won't Fix
    • Priority: Icon: Major Major
    • master, 5.0.0
    • Affects Version/s: master
    • Component/s: Pegasus Planner
    • None

      Sometimes when a service like GridFTP goes down the workflow burns through all of its retries in a few minutes.

      SCEC has this problem sometimes on production runs.

      It would be better if there was some exponential delay for jobs.

      One way to implement this would be to use a requirements expression to do the delay.

      We would have to set up a classad with the retry number like this:

      VARS myjob +DAGNodeRetry="$(RETRY)"

      It looks like that works with Condor 8.2+

      Then we could set up a requirements expression that computes a delay based on the retry. This works for vanilla and local universes:

      requirements = (time() >= (QDate + (DAGNodeRetry * 300)))

      For the grid universe we would probably have to do something different. I tried using hold and periodic_release, but nothing works because hold is not an expression:

      hold = (DAGNodeRetry > 0)
      periodic_release = (time() >= (QDate + (DAGNodeRetry * 300)))

      This doesn't seem to work either:

      hold = $$([DAGNodeRetry > 0])

            rynge Mats Rynge
            gideon Gideon Juve (Inactive)
            4 Start watching this issue