Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-779

Support for job checkpoint files in planner




      While running LIGO ahope codes on Virgo clusters, it was noticed that there is a 12 hour wall clock limit on the nodes. Since certain jobs run more than 12 hours, they keep on failing and start from beginning every time they run.

      The codes themselves, can create checkpoint files periodically. We would like to introduce support for non shared fs environment, where Pegasus tracks checkpoint files and is able to transfer them automatically. this also involves changing kickstart to send a signal to the job before it is killed.

      For this is in DAX user should be able to mark the file as a checkpoint file.

      The planner handling of the checkpoint file is as follows
      1) If the checkpoint file exists in the replica catalog, then it will be staged as part of the stage in jobs
      2) In the PegasusLite case, the planner will set it up to be both the input and output file while transferring to and from the staging site.




            • Assignee:
              vahi Karan Vahi
              dbrown Duncan Brown
            • Watchers:
              3 Start watching this issue


              • Created: