pegasuslite signal handler race condition

XMLWordPrintable

      We've been seeing workflows fail as jobs are marked succeeded, but the transfer of their output data back to the storage site failed, either because they are missing or somehow corrupted. I've confirmed that this is not just due to user's home directories filling up. This happens in both OSG (when a gridftp seems to fail to create the file, but gets marked as success) and regular use on LDG clusters (particularly atlas, whose filesystem can be a bit flaky).

      Since the job is marked as success, subsequent jobs fail as the file does not exist. Manual intervention is needed to fix the workflow, as the rescue dag is not correct. Here's an example of a job that failed on sugar-dev2.phy.syr.edu as it is trying to transfer a file that doesn't exist:

      /usr1/steven.reyes/pycbc-tmp.rAKEaJVW2t/work/main_ID0000001/stage_inter_local_hdf_trigger_merge-BNS_UNIFORM_LOWSPIN3_INJ-H1_ID193_ID0183120_0.out.003

      2016-06-16 20:39:26,583 INFO: --------------------------------------------------------------------------------
      2016-06-16 20:39:26,583 INFO: Starting transfers - attempt 3
      2016-06-16 20:39:28,586 WARNING: Symlink source (/home/steven.reyes/projects/cbc/O1-Analysis/BNS-Rates/Analyses789/UniformMassRuns3/output/osg-scratch/work/main_ID0000001/113413/H1-INSPIRAL_BNS_UNIFORM
      _LOWSPIN3_INJ_JOB13-1134132415-1879.hdf) does not exist
      2016-06-16 20:39:28,586 WARNING: Symlink source (/home/steven.reyes/projects/cbc/O1-Analysis/BNS-Rates/Analyses789/UniformMassRuns3/output/osg-scratch/work/main_ID0000001/113596/H1-INSPIRAL_BNS_UNIFORM
      _LOWSPIN3_INJ_JOB13-1135961639-1851.hdf) does not exist
      2016-06-16 20:39:28,586 INFO: --------------------------------------------------------------------------------

      Looking at the stderr file from this job, it looks like the pegasus lite script never tried too stage the file out after the user task completed. Here'a an excerpt from:

      /usr1/steven.reyes/pycbc-tmp.rAKEaJVW2t/work/main_ID0000001/inspiral-BNS_UNIFORM_LOWSPIN3_INJ-H1_ID191_ID0141553.err.000

                                              1. setting the xbit for executables staged #####################
                                                              1. executing the user tasks #############################
                                                                2016-06-13 19:50:54: /var/condor/execute/dir_43486/glide_vpqRYu/execute/dir_56517/pegasus.JYirNp cleaned up
                                                                PegasusLite: exitcode 0

      In any case, there have been other edge cases where an output file has become corrupted or failed a transfer back to the submission site. Would it be possible to add something to catch these cases and flag the jobs as failed?

            Assignee:
            Mats Rynge
            Reporter:
            Duncan Brown
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: