Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1068

monitord fails when trying to open a job error file in a workflow with condor recovery

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: master, 4.6.0
    • Fix Version/s: master, 4.7.0, 4.6.1
    • Component/s: Monitord
    • Labels:
      None
    • Environment:
      Large LIGO run by amber at syrancuse

      Description

      LIGO has a large run, where the sub workflow is evicted multiple times repeatedly. this causes out of order events in dagman log for sub workflow, that trips monitord over, and it fails when trying to open a job error file ( the location of which it has not parsed from the submit file)

        Attachments

          Activity

            People

            • Assignee:
              vahi Karan Vahi
              Reporter:
              dbrown Duncan Brown
            • Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: