PegasusLite submissions to local cluster (Slurm/PBS/etc) unable to source pegasus-lite-common.sh

XMLWordPrintable

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major
    • master, 5.1.0, 5.0.2
    • Affects Version/s: master, 5.0.1
    • Component/s: Pegasus Planner
    • None
    • Environment:
      RPM install on a login node, with no pegasus installs on the cluster worker nodes.

      When PegasusLite jobs start on a node, they source pegasus-lite-common.sh . This file is transferred into the job sandbox using HTCondor file transfers. In the job submit file you see for example

      transfer_input_files = /usr/share/pegasus/sh/pegasus-lite-common.sh,/nas/home/napiersk/github/isi-vista/aida-integration/pegasus/validation/napiersk/pegasus/demo/run0001/pegasus-worker-5.0.2dev-x86_64_rhel_7.tar.gz

      The pegasus-lite-common.sh is picked up from the Pegasus install that the planner is running from. In this case, it is a RPM install.

      In all native HTCondor environments this does not cause an issue as HTCondor transfers the files correctly. However, in the local cluster (e.g SLURM) case, the Glite/BLAHP layer generates a SLURM submit file for the job, that is then submitted to slurm. In that slurm job submit file, the transfer_input_files directives get written out as cp commands.

      Hence, when a job starts on a worker node, the slurm job script attempts to cp from a source ( /usr/share/pegasus/sh/pegasus-lite-common.sh ), that does not exist. Hence the jobs fail

            Assignee:
            Karan Vahi
            Reporter:
            Karan Vahi
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: