Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1107

pegasuslite signal handler race condition

    XMLWordPrintable

    Details

      Description


      We've been seeing workflows fail as jobs are marked succeeded, but the transfer of their output data back to the storage site failed, either because they are missing or somehow corrupted. I've confirmed that this is not just due to user's home directories filling up. This happens in both OSG (when a gridftp seems to fail to create the file, but gets marked as success) and regular use on LDG clusters (particularly atlas, whose filesystem can be a bit flaky).

      Since the job is marked as success, subsequent jobs fail as the file does not exist. Manual intervention is needed to fix the workflow, as the rescue dag is not correct. Here's an example of a job that failed on sugar-dev2.phy.syr.edu as it is trying to transfer a file that doesn't exist:

      /usr1/steven.reyes/pycbc-tmp.rAKEaJVW2t/work/main_ID0000001/stage_inter_local_hdf_trigger_merge-BNS_UNIFORM_LOWSPIN3_INJ-H1_ID193_ID0183120_0.out.003


      2016-06-16 20:39:26,583 INFO: --------------------------------------------------------------------------------
      2016-06-16 20:39:26,583 INFO: Starting transfers - attempt 3
      2016-06-16 20:39:28,586 WARNING: Symlink source (/home/steven.reyes/projects/cbc/O1-Analysis/BNS-Rates/Analyses789/UniformMassRuns3/output/osg-scratch/work/main_ID0000001/113413/H1-INSPIRAL_BNS_UNIFORM
      _LOWSPIN3_INJ_JOB13-1134132415-1879.hdf) does not exist
      2016-06-16 20:39:28,586 WARNING: Symlink source (/home/steven.reyes/projects/cbc/O1-Analysis/BNS-Rates/Analyses789/UniformMassRuns3/output/osg-scratch/work/main_ID0000001/113596/H1-INSPIRAL_BNS_UNIFORM
      _LOWSPIN3_INJ_JOB13-1135961639-1851.hdf) does not exist
      2016-06-16 20:39:28,586 INFO: --------------------------------------------------------------------------------

      Looking at the stderr file from this job, it looks like the pegasus lite script never tried too stage the file out after the user task completed. Here'a an excerpt from:

      /usr1/steven.reyes/pycbc-tmp.rAKEaJVW2t/work/main_ID0000001/inspiral-BNS_UNIFORM_LOWSPIN3_INJ-H1_ID191_ID0141553.err.000

      ##################### setting the xbit for executables staged #####################

      ############################# executing the user tasks #############################
      2016-06-13 19:50:54: /var/condor/execute/dir_43486/glide_vpqRYu/execute/dir_56517/pegasus.JYirNp cleaned up
      PegasusLite: exitcode 0

      In any case, there have been other edge cases where an output file has become corrupted or failed a transfer back to the submission site. Would it be possible to add something to catch these cases and flag the jobs as failed?

        Attachments

          Activity

            People

            • Assignee:
              rynge Mats Rynge
              Reporter:
              dbrown Duncan Brown
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: