I tried to submit small Pegasus workflow to OSG with the Singularity image I created earlier, but I encountered some errors.
I believe I am missing something, so I would appreciate if you can give me some hints when you have time.
1.
The error I am getting is:
-------------Job stderr file - 00/00/ex_sra_run_ID0000001.err.001-------------
2019-08-01 01:17:52: PegasusLite: version 4.9.1
2019-08-01 01:17:52: Executing on host hcc-5139558.0-red-c7107.unl.edu OSG_SITE_NAME=Red GLIDEIN_Site=Nebraska GLIDEIN_ResourceName=Nebraska
########################[Pegasus Lite] Setting up workdir ########################
2019-08-01 01:17:52: Not creating a new work directory as it is already set to /var/lib/condor/execute/dir_307126/glide_AwvrJV/execute/dir_9774
##############[Pegasus Lite] Figuring out the worker package to use ##############
2019-08-01 01:17:52: The job contained a Pegasus worker package
2019-08-01 01:17:52: Warning: worker package pegasus-worker-4.9.1-x86_64_rhel_6.tar.gz does not seem to match the system x86_64_rhel_7
2019-08-01 01:17:52: Using /cvmfs/oasis.opensciencegrid.org/osg/projects/pegasus/worker/4.9.1/x86_64_rhel_7 as worker package
########[Pegasus Lite] Writing out script to launch user task in container ########
2019-08-01 01:17:52: Copied credential $X509_USER_PROXY to /var/lib/condor/execute/dir_307126/glide_AwvrJV/execute/dir_9774/myproxy
2019-08-01 01:17:52: Set $X509_USER_PROXY to /scratch/myproxy (for inside the container)
2019-08-01 01:17:52: container file is salmonella_ice
pegasus-lite-common.sh: line 342: docker: command not found
2019-08-01 01:17:52: Unable to load image from salmonella_ice
2019-08-01 01:17:52: Last command exited with 1
PegasusLite: exitcode 1
In tc.txt I have:
cont salmonella_ice
tr ex_sra_run {
site condor_pool
}
And the sites are:
<site handle="local" arch="x86_64" os="LINUX">
<directory type="shared-scratch" path="${PWD}/scratch">
<file-server operation="all" url="file://${PWD}/scratch"/>
</directory>
<directory type="local-storage" path="${PWD}/outputs">
<file-server operation="all" url="file://${PWD}/outputs"/>
</directory>
</site>
<site handle="local-hcc" arch="x86_64" os="LINUX">
<directory type="shared-scratch" path="${PWD}/out">
<file-server operation="all" url="file://${PWD}/out"/>
</directory>
<profile namespace="pegasus" key="style">glite</profile>
<profile namespace="condor" key="grid_resource">batch slurm</profile>
<profile namespace="pegasus" key="queue">batch,tmp_anvil,devel</profile>
<profile namespace="env" key="PEGASUS_HOME">/usr</profile>
<profile namespace="env" key="PATH">/usr/bin:/bin:/sbin/</profile>
<profile namespace="condor" key="request_memory"> ifthenelse(isundefined(DAGNodeRetry) || DAGNodeRetry == 2000, 4000, 6000) </profile>
</site>
<site handle="condor_pool" arch="x86_64" os="LINUX">
<profile namespace="condor" key="requirements">HasSingularity == True</profile>
<profile namespace="pegasus" key="style" >condor</profile>
<profile namespace="condor" key="universe" >vanilla</profile>
<profile namespace="condor" key="request_memory" >2 GB</profile>
<profile namespace="condor" key="request_disk" >5 GB</profile>
</site>
Since the error is "pegasus-lite-common.sh: line 342: docker: command not found", is there something additionally I need to specify in the requirements section?
2.
In tc.txt I initially had:
cont salmonella_ice
However, I got the error "Unable to pull docker://npavlovikj/ffh-workflow:latest: While searching for mksquashfs: exec: "mksquashfs": executable file not found in $PATH" for the staging part. We do have "mksquashfs" on both login and worker nodes in "/sbin/", and I tried adding this to PATH in dax.py and sites.xml, but I couldn't overwrite the PATH var shown by the workflow which only has "/usr/bin:/bin" in it.
3.
Because the above didn't work, and I can not push to "shub" in a straight-forward manner, I decided to use the link you created for me last time, "https://workflow.isi.edu/scratch/rynge/ffh-workflow_latest.sif". When I use URL, I can not specify type to either "docker" or "singularity", because it complains about the .sif extension (if I don't use the extension, it downloads "https://workflow.isi.edu/scratch/rynge/ffh-workflow_latest" which doesn't exist).
I tried the "Population Modeling using Containers" tutorial example you have provided using Singularity and OSG, and that exampled worked fine. Therefore, I wonder if the error I get is because of the type of source I use for the container? If uploading the image to the CVMFS Singularity repository is easier and will fix this, then I can do that. The image I have now is not the final one, but as long as the OSG image is automatically updated when I modify mine and I don't need to bother anyone to do that for me, I am ok with doing that.