planner drops transfer_(in|out)put_files if NoGridStart is used

XMLWordPrintable

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major
    • master, 5.0.0, 4.9.2
    • Affects Version/s: master, 4.9.1
    • Component/s: Pegasus Planner
    • None

      We need to run pycbc_inference jobs using condor i/o and without kickstart so that we can implement condor's native vanilla universe checkpointing. Pegasus' checkpointing mechanism causes too much badput with these jobs.

      However, turning on NoGridStart disables transfer input and output files. Here is a job run without NoGridStart:

      [dbrown@sugwg-condor condor-checkpoint-sig]$ grep transfer test_condorio-main_ID0000001/testjob_j1.sub
      should_transfer_files = YES
      transfer_executable = true
      transfer_input_files = /home/dbrown/projects/osg/condor-checkpoint-sig/my.input,/usr/share/pegasus/sh/pegasus-lite-common.sh,/home/dbrown/projects/osg/condor-checkpoint-sig/./test_condorio-main_ID0000001.000/pegasus-worker-4.9.1dev-x86_64_rhel_7.tar.gz
      transfer_output_files = my.output,my.checkpoint,wrapper.log,wrapper.checkpoint,
      when_to_transfer_output = ON_EXIT_OR_EVICT

      However, when I add

      profile pegasus "gridstart" "NoGridStart"

      to the transformation catalog, the planner correctly generates only a .sub file and not a .sh file, but transfer_input_files and transfer_output_files are missing from the .sub file:

      [dbrown@sugwg-condor condor-checkpoint-sig]$ grep transfer test_condorio-main_ID0000001/testjob_j1.sub
      should_transfer_files = YES
      transfer_executable = false
      when_to_transfer_output = ON_EXIT_OR_EVICT

      I couldn't see an obvious solution looking through the planner code, as this is split between the SLS and GridStart classes, but Karan may know an easy fix.

            Assignee:
            Karan Vahi
            Reporter:
            Duncan Brown
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: