-
Type: Improvement
-
Resolution: Fixed
-
Priority: Major
-
Affects Version/s: master
-
Component/s: Pegasus Planner
-
None
Currently Pegasus specifies values for periodic_release and periodic_remove that enable a job to cycle through hold/release 3 times. This is equivalent to retrying the job 3 times. Since we already have DAGMan retries this functionality is not necessary. Furthermore, it is harmful because a) combining it with DAGMan retries can result in the job running 9 times before permanently failing, b) it can cause log files to be lost, and c) it makes it difficult to detect hold/release cycles in running workflows.
One solution is to specify:
periodic_release = False
periodic_remove = (JobStatus==5)
That causes any job that enters the hold state to be removed immediately. With these settings, DAGMan does treat held jobs as failures, does invoke the exitcode script, and does retry the job.