Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1384

.sig Singularity images (naming issue?)

XMLWordPrintable

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major Major
    • master, 4.9.2
    • Affects Version/s: master, 4.9.2
    • Component/s: None
    • None

      I tried to submit small Pegasus workflow to OSG with the Singularity image I created earlier, but I encountered some errors.
      I believe I am missing something, so I would appreciate if you can give me some hints when you have time.

      1.
      The error I am getting is:
      -------------Job stderr file - 00/00/ex_sra_run_ID0000001.err.001-------------

      2019-08-01 01:17:52: PegasusLite: version 4.9.1
      2019-08-01 01:17:52: Executing on host hcc-5139558.0-red-c7107.unl.edu OSG_SITE_NAME=Red GLIDEIN_Site=Nebraska GLIDEIN_ResourceName=Nebraska

      ########################[Pegasus Lite] Setting up workdir ########################
      2019-08-01 01:17:52: Not creating a new work directory as it is already set to /var/lib/condor/execute/dir_307126/glide_AwvrJV/execute/dir_9774

      ##############[Pegasus Lite] Figuring out the worker package to use ##############
      2019-08-01 01:17:52: The job contained a Pegasus worker package
      2019-08-01 01:17:52: Warning: worker package pegasus-worker-4.9.1-x86_64_rhel_6.tar.gz does not seem to match the system x86_64_rhel_7
      2019-08-01 01:17:52: Using /cvmfs/oasis.opensciencegrid.org/osg/projects/pegasus/worker/4.9.1/x86_64_rhel_7 as worker package

      ########[Pegasus Lite] Writing out script to launch user task in container ########
      2019-08-01 01:17:52: Copied credential $X509_USER_PROXY to /var/lib/condor/execute/dir_307126/glide_AwvrJV/execute/dir_9774/myproxy
      2019-08-01 01:17:52: Set $X509_USER_PROXY to /scratch/myproxy (for inside the container)
      2019-08-01 01:17:52: container file is salmonella_ice
      pegasus-lite-common.sh: line 342: docker: command not found
      2019-08-01 01:17:52: Unable to load image from salmonella_ice
      2019-08-01 01:17:52: Last command exited with 1
      PegasusLite: exitcode 1

      In tc.txt I have:
      cont salmonella_ice

      { # type "singularity" image "https://workflow.isi.edu/scratch/rynge/ffh-workflow_latest.sif" }

      tr ex_sra_run {
      site condor_pool

      { type "INSTALLED" container "salmonella_ice" pfn "file:///opt/anaconda/bin/fastq-dump" }

      }

      And the sites are:
      <site handle="local" arch="x86_64" os="LINUX">
      <directory type="shared-scratch" path="${PWD}/scratch">
      <file-server operation="all" url="file://${PWD}/scratch"/>
      </directory>
      <directory type="local-storage" path="${PWD}/outputs">
      <file-server operation="all" url="file://${PWD}/outputs"/>
      </directory>
      </site>

      <site handle="local-hcc" arch="x86_64" os="LINUX">
      <directory type="shared-scratch" path="${PWD}/out">
      <file-server operation="all" url="file://${PWD}/out"/>
      </directory>
      <profile namespace="pegasus" key="style">glite</profile>
      <profile namespace="condor" key="grid_resource">batch slurm</profile>
      <profile namespace="pegasus" key="queue">batch,tmp_anvil,devel</profile>
      <profile namespace="env" key="PEGASUS_HOME">/usr</profile>
      <profile namespace="env" key="PATH">/usr/bin:/bin:/sbin/</profile>
      <profile namespace="condor" key="request_memory"> ifthenelse(isundefined(DAGNodeRetry) || DAGNodeRetry == 2000, 4000, 6000) </profile>
      </site>

      <site handle="condor_pool" arch="x86_64" os="LINUX">
      <profile namespace="condor" key="requirements">HasSingularity == True</profile>
      <profile namespace="pegasus" key="style" >condor</profile>
      <profile namespace="condor" key="universe" >vanilla</profile>
      <profile namespace="condor" key="request_memory" >2 GB</profile>
      <profile namespace="condor" key="request_disk" >5 GB</profile>
      </site>

      Since the error is "pegasus-lite-common.sh: line 342: docker: command not found", is there something additionally I need to specify in the requirements section?

      2.
      In tc.txt I initially had:
      cont salmonella_ice

      { type "singularity" image "docker://npavlovikj/ffh-workflow:latest" }

      However, I got the error "Unable to pull docker://npavlovikj/ffh-workflow:latest: While searching for mksquashfs: exec: "mksquashfs": executable file not found in $PATH" for the staging part. We do have "mksquashfs" on both login and worker nodes in "/sbin/", and I tried adding this to PATH in dax.py and sites.xml, but I couldn't overwrite the PATH var shown by the workflow which only has "/usr/bin:/bin" in it.

      3.
      Because the above didn't work, and I can not push to "shub" in a straight-forward manner, I decided to use the link you created for me last time, "https://workflow.isi.edu/scratch/rynge/ffh-workflow_latest.sif". When I use URL, I can not specify type to either "docker" or "singularity", because it complains about the .sif extension (if I don't use the extension, it downloads "https://workflow.isi.edu/scratch/rynge/ffh-workflow_latest" which doesn't exist).

      I tried the "Population Modeling using Containers" tutorial example you have provided using Singularity and OSG, and that exampled worked fine. Therefore, I wonder if the error I get is because of the type of source I use for the container? If uploading the image to the CVMFS Singularity repository is easier and will fix this, then I can do that. The image I have now is not the final one, but as long as the OSG image is automatically updated when I modify mine and I don't need to bother anyone to do that for me, I am ok with doing that.

            Assignee:
            vahi Karan Vahi
            Reporter:
            rynge Mats Rynge
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: