Loading...

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

XML

Word

Printable

Type: New Feature
Resolution: Fixed
Priority: Major
Fix Version/s: master, 4.5.0
Affects Version/s: master
Component/s: Pegasus Planner, Planner: Transfer Module
Labels:
None
Environment:
NonShared FS deployed on Virgo clusters

While running LIGO ahope codes on Virgo clusters, it was noticed that there is a 12 hour wall clock limit on the nodes. Since certain jobs run more than 12 hours, they keep on failing and start from beginning every time they run.

The codes themselves, can create checkpoint files periodically. We would like to introduce support for non shared fs environment, where Pegasus tracks checkpoint files and is able to transfer them automatically. this also involves changing kickstart to send a signal to the job before it is killed.

For this is in DAX user should be able to mark the file as a checkpoint file.

The planner handling of the checkpoint file is as follows
1) If the checkpoint file exists in the replica catalog, then it will be staged as part of the stage in jobs
2) In the PegasusLite case, the planner will set it up to be both the input and output file while transferring to and from the staging site.

Assignee:: Karan Vahi
Reporter:: Duncan Brown

Created:: 21/Aug/14 2:35 PM
Updated:: 27/Oct/14 2:36 PM
Resolved:: 27/Oct/14 2:36 PM
Archived:: 14/Dec/24 10:43 PM

Details

Description

Attachments

Activity

People

Dates