Consider eliminating hold/release cycles caused by periodic_release/periodic_remove

XMLWordPrintable

    • Type: Improvement
    • Resolution: Fixed
    • Priority: Major
    • 4.1
    • Affects Version/s: master
    • Component/s: Pegasus Planner
    • None

      Currently Pegasus specifies values for periodic_release and periodic_remove that enable a job to cycle through hold/release 3 times. This is equivalent to retrying the job 3 times. Since we already have DAGMan retries this functionality is not necessary. Furthermore, it is harmful because a) combining it with DAGMan retries can result in the job running 9 times before permanently failing, b) it can cause log files to be lost, and c) it makes it difficult to detect hold/release cycles in running workflows.

      One solution is to specify:
      periodic_release = False
      periodic_remove = (JobStatus==5)

      That causes any job that enters the hold state to be removed immediately. With these settings, DAGMan does treat held jobs as failures, does invoke the exitcode script, and does retry the job.

            Assignee:
            Karan Vahi
            Reporter:
            Gideon Juve (Inactive)
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: