-
Type: Bug
-
Resolution: Fixed
-
Priority: Major
-
Affects Version/s: master, 4.3.1
-
Component/s: Pegasus Planner, Planner: Transfer Module
-
None
This affects the case where the staging site and execution site are same. The workflow is run in non shared fs mode and the staging site has a remote server associated with it.
In the PegasusLite wrapper script for the jobs, the planner creates this sequence
srm://path/to/dir/on/staging-site/file
symlink://$PWD/file
This results in pegasus-transfer doing a two transfer for the symlink, that breaks globbing code in the application being run.
In this case, the planner should have had the following in
file://path/to/dir/on/statingsite/file
symlink://$PWD/file
Background email from John Veitch:
Here's the sequence of events:
1. Pegasus copies the frame files onto the head node of the remote cluster,
keep their names correct.
2. Since we are not relying on NFS, when the compute job (lalinference_nest or
inspiral) needs the data, Pegasus wraps it in a little script that does the
following crucial steps before running the exe:
- Copies the data onto the local node's /tmp space
lcg-cp -b -D srmv2
'srm://tbn18.nikhef.nl:8446/srm/managerv2?SFN=/dpm/nikhef.nl/home/virgo/lalinference_pegasus_97c62984-7936-11e3-
b419-005056a54299/H-H1_LDAS_C02_L2-968654208-128.gwf' 'file:////tmp/pegasus-
transfer-WCh7IN.data'
- Makes a soft link from the /tmp space to the working directory for the job
ln -f -s '/tmp/pegasus-transfer-WCh7IN.data'
'/tmp/jobdir/39783697.stro.nikhef.nl/pegasus.gsOTls/H-
H1_LDAS_C02_L2-968654208-128.gwf'
Note that the first of those two steps copies it to a non-GWF filename
"pegasus-transfer-WCh7IN.data"
3. The analysis code correctly globs for the pattern (*.gwf), finding the soft
links.
4. When LALFrame loads the files into a FrCache structure, it somehow resolves
the links and inserts the resolved filename (pegasus-transfer-WCh7IN.data) in
the appropriate field in the struct. This is the field that CacheSieve looks
at!
5. CacheSieve cannot find any filenames that match the pattern, so it fails to
find any data.
The solution I have just now replaces step 3 with a glob for H-*.gwf or
whatever, then skips the Sieve stage. This is clearly good enough for now, but
if we do ever have H-H2_LDAS_STRAIN-1234567890.gwf or whatever files then it
won't work any more. I'm willing to come back to that later however if you
think it's OK this way for inspiral too?
I'll have to patch the inspiral code to do the same thing otherwise it will
fail in the same way as lalinference (the frame-reading code is very similar
so that won't be a problem).