Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1346

Pegasus job checkpointing is incompatible with condorio

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: master, 4.9.0
    • Fix Version/s: master, 5.0.0, 4.9.2
    • Component/s: pegasus-plan
    • Labels:
      None

      Description

      Pegasus job checkpointing is incompatible with setting

      pegasus.data.configuration=condorio

      Setting this results in condor trying to always transfer the checkpoint file on job startup:

      transfer_input_files = /home/daniel.finstad/projects/bh_spin_priors/gw150914/gw150914_inference_tf2.ini,H1L1V1-CREATE_INJECTIONS_0-1126259454-16.hdf,H1L1V1-INFERENCE_0-1126259454-16.hdf.checkpoint,/usr/share/pegasus/sh/pegasus-lite-common.sh,/usr1/dbrown/daniel/./test_condorio-main_ID0000001.000/pegasus-worker-4.8.4-x86_64_rhel_7.tar.gz

      which results in the Condor error

      Error from slot1@CRUSH-SUGWG-OSG-10-5-229-88: SHADOW at 128.230.190.43 failed to send file(s) to <128.230.11.10:22390>: error reading from /usr1/dbrown/daniel/./test_condorio-main_ID0000001.000/H1L1V1-INFERENCE_0-1126259454-16.hdf.checkpoint: (errno 2) No such file or directory

      I'm not sure if there's a way to tell condor that it's OK if a transfer input/output file does not exist.

        Attachments

          Activity

            People

            • Assignee:
              vahi Karan Vahi
              Reporter:
              dbrown Duncan Brown
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: