-
Type: Improvement
-
Resolution: Won't Fix
-
Priority: Major
-
Affects Version/s: master
-
Component/s: Pegasus Planner
-
None
Sometimes when a service like GridFTP goes down the workflow burns through all of its retries in a few minutes.
SCEC has this problem sometimes on production runs.
It would be better if there was some exponential delay for jobs.
One way to implement this would be to use a requirements expression to do the delay.
We would have to set up a classad with the retry number like this:
VARS myjob +DAGNodeRetry="$(RETRY)"
It looks like that works with Condor 8.2+
Then we could set up a requirements expression that computes a delay based on the retry. This works for vanilla and local universes:
requirements = (time() >= (QDate + (DAGNodeRetry * 300)))
For the grid universe we would probably have to do something different. I tried using hold and periodic_release, but nothing works because hold is not an expression:
hold = (DAGNodeRetry > 0)
periodic_release = (time() >= (QDate + (DAGNodeRetry * 300)))
This doesn't seem to work either:
hold = $$([DAGNodeRetry > 0])