Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1944

Ability to specify a wrapper/launcher for containerized jobs in PegasusLite

XMLWordPrintable

    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Major Major
    • master, 5.1.0, 5.0.8
    • Affects Version/s: master, 5.0.7
    • Component/s: PegasusLite
    • None

      This is similar to PM-1461. However applies for containerized jobs. 

       

      When running ML training jobs on Neocortex, the jobs run with cores (slurm parameter ntasks > 1) . At least on neocortex, the training job needs to be launched via srun from within the sbatch script.  For example, we need an invocation like

       

      srun --kill-on-bad-exit $singularity_exec exec --no-home --bind $PWD:/srv --bind /ocean/projects/cis240026p/vahi/workflows/neocortex/scratch:/ocean/projects/cis240026p/vahi/workflows/neocortex/scratch:rw cerebras.sif /srv/train_ID0000003-cont.sh 

       

      The gridstart.launcher keys dont work for a containerized job because in that case the invocation in PegasusLite kickstart gets wrapped. Which means srun needs be inside the container. srun needs to be outside the container for it to work correctly. Also this cannot be achieved by specifying a wrapper, and mentioning the wrapper as the executable path in TC, as in that case also the srun will be invoked inside the container

       

            Assignee:
            vahi Karan Vahi
            Reporter:
            vahi Karan Vahi
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: