-
Type: Improvement
-
Resolution: Fixed
-
Priority: Major
-
Affects Version/s: master, 5.0.7
-
Component/s: Pegasus Lite
-
None
This is similar to PM-1461. However applies for containerized jobs.
When running ML training jobs on Neocortex, the jobs run with cores (slurm parameter ntasks > 1) . At least on neocortex, the training job needs to be launched via srun from within the sbatch script. For example, we need an invocation like
srun --kill-on-bad-exit $singularity_exec exec --no-home --bind $PWD:/srv --bind /ocean/projects/cis240026p/vahi/workflows/neocortex/scratch:/ocean/projects/cis240026p/vahi/workflows/neocortex/scratch:rw cerebras.sif /srv/train_ID0000003-cont.sh
The gridstart.launcher keys dont work for a containerized job because in that case the invocation in PegasusLite kickstart gets wrapped. Which means srun needs be inside the container. srun needs to be outside the container for it to work correctly. Also this cannot be achieved by specifying a wrapper, and mentioning the wrapper as the executable path in TC, as in that case also the srun will be invoked inside the container