Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1156

PegasusLite to tar up the contents of the cwd in case of job failure


      we have a user request to be able to tar up the contents of the job directory in case of error and transfer it back.

      Snippet of email from Greg

      we consider the contents of the working runtime directory on the compute node
      (so, the execution site) to be highly valuable for forensics if something goes wrong. Thus we are considering how we might 'snapshot' or 'freeze dry' the contents of that working directory upon errors/exceptions, and retrieve it for analysis. While a sharedfs configuration may be common for us going forward, we would like to be able to handle the 'worst case scenario' of say condorio or nonsharedfs configurations (e.g., FermiGrid - Open Science Grid) where there is a risk of losing the node, losing the contents of the working directory, and then never fully understanding the cause of failure (if the cause is sufficiently complex, such that stdout/err do not provide sufficient explanation.)

      Some context on our work is that DES has its own customized workflow system that we have used for FermiGrid processing. In that work we have
      encoded the ability to tar up the execution node working directory and transfer it out if a wrapper script detects errors/exceptions in a job. (The tar file is not a predictable output of every job, rather only those that encounter
      an error/exception.)

            rynge Mats Rynge
            vahi Karan Vahi
            2 Start watching this issue