-
Type: Bug
-
Resolution: Fixed
-
Priority: Major
-
Affects Version/s: master, 5.0.1
-
Component/s: Pegasus Planner
-
None
-
Environment:RPM install on a login node, with no pegasus installs on the cluster worker nodes.
When PegasusLite jobs start on a node, they source pegasus-lite-common.sh . This file is transferred into the job sandbox using HTCondor file transfers. In the job submit file you see for example
transfer_input_files = /usr/share/pegasus/sh/pegasus-lite-common.sh,/nas/home/napiersk/github/isi-vista/aida-integration/pegasus/validation/napiersk/pegasus/demo/run0001/pegasus-worker-5.0.2dev-x86_64_rhel_7.tar.gz
The pegasus-lite-common.sh is picked up from the Pegasus install that the planner is running from. In this case, it is a RPM install.
In all native HTCondor environments this does not cause an issue as HTCondor transfers the files correctly. However, in the local cluster (e.g SLURM) case, the Glite/BLAHP layer generates a SLURM submit file for the job, that is then submitted to slurm. In that slurm job submit file, the transfer_input_files directives get written out as cp commands.
Hence, when a job starts on a worker node, the slurm job script attempts to cp from a source ( /usr/share/pegasus/sh/pegasus-lite-common.sh ), that does not exist. Hence the jobs fail