Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-462

Consider eliminating hold/release cycles caused by periodic_release/periodic_remove

    XMLWordPrintable

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • master
    • 4.1
    • Pegasus Planner
    • None

    Description

      Currently Pegasus specifies values for periodic_release and periodic_remove that enable a job to cycle through hold/release 3 times. This is equivalent to retrying the job 3 times. Since we already have DAGMan retries this functionality is not necessary. Furthermore, it is harmful because a) combining it with DAGMan retries can result in the job running 9 times before permanently failing, b) it can cause log files to be lost, and c) it makes it difficult to detect hold/release cycles in running workflows.

      One solution is to specify:
      periodic_release = False
      periodic_remove = (JobStatus==5)

      That causes any job that enters the hold state to be removed immediately. With these settings, DAGMan does treat held jobs as failures, does invoke the exitcode script, and does retry the job.

      Attachments

        Activity

          People

            vahi Karan Vahi
            gideon Gideon Juve (Inactive)
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: