Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-734

symlink in PegasusLite is created incorrectly.

XMLWordPrintable

      This affects the case where the staging site and execution site are same. The workflow is run in non shared fs mode and the staging site has a remote server associated with it.

      In the PegasusLite wrapper script for the jobs, the planner creates this sequence

      srm://path/to/dir/on/staging-site/file
      symlink://$PWD/file

      This results in pegasus-transfer doing a two transfer for the symlink, that breaks globbing code in the application being run.

      In this case, the planner should have had the following in
      file://path/to/dir/on/statingsite/file
      symlink://$PWD/file

      Background email from John Veitch:

      Here's the sequence of events:
      1. Pegasus copies the frame files onto the head node of the remote cluster,
      keep their names correct.
      2. Since we are not relying on NFS, when the compute job (lalinference_nest or
      inspiral) needs the data, Pegasus wraps it in a little script that does the
      following crucial steps before running the exe:

      1. Copies the data onto the local node's /tmp space
        lcg-cp -b -D srmv2
        'srm://tbn18.nikhef.nl:8446/srm/managerv2?SFN=/dpm/nikhef.nl/home/virgo/lalinference_pegasus_97c62984-7936-11e3-
        b419-005056a54299/H-H1_LDAS_C02_L2-968654208-128.gwf' 'file:////tmp/pegasus-
        transfer-WCh7IN.data'
      1. Makes a soft link from the /tmp space to the working directory for the job
        ln -f -s '/tmp/pegasus-transfer-WCh7IN.data'
        '/tmp/jobdir/39783697.stro.nikhef.nl/pegasus.gsOTls/H-
        H1_LDAS_C02_L2-968654208-128.gwf'

      Note that the first of those two steps copies it to a non-GWF filename
      "pegasus-transfer-WCh7IN.data"

      3. The analysis code correctly globs for the pattern (*.gwf), finding the soft
      links.

      4. When LALFrame loads the files into a FrCache structure, it somehow resolves
      the links and inserts the resolved filename (pegasus-transfer-WCh7IN.data) in
      the appropriate field in the struct. This is the field that CacheSieve looks
      at!

      5. CacheSieve cannot find any filenames that match the pattern, so it fails to
      find any data.

      The solution I have just now replaces step 3 with a glob for H-*.gwf or
      whatever, then skips the Sieve stage. This is clearly good enough for now, but
      if we do ever have H-H2_LDAS_STRAIN-1234567890.gwf or whatever files then it
      won't work any more. I'm willing to come back to that later however if you
      think it's OK this way for inspiral too?

      I'll have to patch the inspiral code to do the same thing otherwise it will
      fail in the same way as lalinference (the frame-reading code is very similar
      so that won't be a problem).

            Assignee:
            vahi Karan Vahi
            Reporter:
            vahi Karan Vahi
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: