-
Type: Improvement
-
Resolution: Fixed
-
Priority: Major
-
Affects Version/s: master, 4.5.1
-
Component/s: Catalog: Replica Catalog, Pegasus Planner
-
None
i Karan,
I’m following up on a problem we saw back when putting the XSEDE proposal together. I’m on sugar-dev3 in
/home/lppekows/projects/XRAC_Jul2015/pool_test/962582415-962625615
The gwf files have pfn mappings in the dag with local entries, such as
<file name="96261/L-L1_LDAS_C02_L2-962613760-128.gwf">
<pfn url="/frames/S6/LDAS_C02_L2/L1/L-L1_LDAS_C02_L2-9626/L-L1_LDAS_C02_L2-962613760-128.gwf" site="local"/>
</file>
but we also have a cache file with entries for both sites
$ grep 96261/L-L1_LDAS_C02_L2-962613760-128.gwf /home/lppekows/projects/XRAC_Jul2015/pool_test/962582415-962625615/_reuse.cache
96261/L-L1_LDAS_C02_L2-962613760-128.gwf /scratch/02750/stuart/frames/S6/LDAShoftC02/LLO/L-L1_LDAS_C02_L2-9626/L-L1_LDAS_C02_L2-962613760-128.gwf pool="stampede"
96261/L-L1_LDAS_C02_L2-962613760-128.gwf /frames/S6/LDAS_C02_L2/L1/L-L1_LDAS_C02_L2-9626/L-L1_LDAS_C02_L2-962613760-128.gwf pool="local"
At the planning stage:
$ pwd
/usr1/lppekows/pycbc-tmp.vlVRykqsxb/work
$ /usr1/lppekows/pycbc-tmp.vlVRykqsxb/work/subdax_main_ID0000001_pre.sh -Dpegasus.log.*=/usr1/lppekows/pycbc-tmp.vlVRykqsxb/work/subdax_main_ID0000001.pre.log -Dpegasus.workflow.root.uuid=60190396-09c2-4b12-ba2d-6ef1725a0437 -Dpegasus.dir.storage.mapper.replica=File -Dpegasus.dir.storage.mapper.replica.file=/home/lppekows/projects/XRAC_Jul2015/pool_test/962582415-962625615/main.map --conf /usr1/lppekows/pycbc-tmp.vlVRykqsxb/work/pegasus.3320546546350005354.properties --dir /usr1/lppekows/pycbc-tmp.vlVRykqsxb --relative-dir work/main_ID0000001 --relative-submit-dir work/./main_ID0000001 --sites local,stampede --cache /home/lppekows/projects/XRAC_Jul2015/pool_test/962582415-962625615/_reuse.cache,/usr1/lppekows/pycbc-tmp.vlVRykqsxb/work/weekly_ahope-0.cache --inherited-rc-files /usr1/lppekows/pycbc-tmp.vlVRykqsxb/work/weekly_ahope-0.replica.store --cluster label,horizontal --output-site local --cleanup none --deferred --group pegasus --dax /home/lppekows/projects/XRAC_Jul2015/pool_test/962582415-962625615/main.dax -vvv
it seems that Pegasus doesn’t see the Stampede entries. The log has messages such as
2015.08.14 14:02:04.490 EDT: [DEBUG] Selecting a pfn for lfn 96261/L-L1_LDAS_C02_L2-962613760-128.gwf
amongst[(/frames/S6/LDAS_C02_L2/L1/L-L1_LDAS_C02_L2-9626/L-L1_LDAS_C02_L2-962613760-128.gwf,
)]
Consequently the frame files are unnecessarily transferred to Stampede. During the proposal we hacked around this by removing the entries from the dax, after which everything worked as expected.
I’ve tried reproducing this problem with a small hand-written dax (in /home/lppekows/projects/pegasus) which I think has all the essential features; a local entry in the dax and two entries in the cache file, but so far I haven’t been able to reproduce this. Either my test is missing something or maybe the problem only triggered when the dax or cache exceeds a certain size.
Would you mind taking a look?
Thanks,
- Larne