-
Type: New Feature
-
Resolution: Won't Fix
-
Priority: Major
-
Affects Version/s: master, 4.7.2
-
Component/s: CLI: pegasus-transfer, Pegasus Lite
-
None
-
Environment:PegasusLite
we have a user request to be able to tar up the contents of the job directory in case of error and transfer it back.
Snippet of email from Greg
we consider the contents of the working runtime directory on the compute node
(so, the execution site) to be highly valuable for forensics if something goes wrong. Thus we are considering how we might 'snapshot' or 'freeze dry' the contents of that working directory upon errors/exceptions, and retrieve it for analysis. While a sharedfs configuration may be common for us going forward, we would like to be able to handle the 'worst case scenario' of say condorio or nonsharedfs configurations (e.g., FermiGrid - Open Science Grid) where there is a risk of losing the node, losing the contents of the working directory, and then never fully understanding the cause of failure (if the cause is sufficiently complex, such that stdout/err do not provide sufficient explanation.)
Some context on our work is that DES has its own customized workflow system that we have used for FermiGrid processing. In that work we have
encoded the ability to tar up the execution node working directory and transfer it out if a wrapper script detects errors/exceptions in a job. (The tar file is not a predictable output of every job, rather only those that encounter
an error/exception.)