pegasus lite jobs fail at CIT if there is a lost+found dir in the condor scratch dir

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

XMLWordPrintable

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major
    • master, 5.1.0, 5.0.8
    • Affects Version/s: master, 5.0.7
    • Component/s: Pegasus Lite
    • None

      TWO: When running the bash function `pegasus_lite_setup_work_dir` (which all PyCBC jobs will do), all jobs fail. The failure is in the bash command:
      Two: I am seeing failures that I think are a combination of an issue in pegasus and configuration changes on CIT ... I think these failures would happen on older pegasus versions as well (so we may have issues if ldas-osg works the same as the ldas-pcdev4 testing submit machine).

                  # PM-968 if provided, copy lof files from the HTCondor iwd to the PegasusLite work dir
                  find $pegasus_lite_start_dir -name *.lof -exec cp {} $pegasus_lite_work_dir/ \; >/dev/null 2>&1

      which fails with "find: ‘/var/lib/condor/execute/dir_711539/lost+found’: Permission denied". The failure causes the job to stop and fail (even though I think it could be ignored). Some thoughts:

      • On the pegasus end, I think you should only be searching in $pegasus_lite_start_dir and not subdirectories for this file, so making this:

              find $pegasus_lite_start_dir -maxdepth 1 -name *.lof -exec cp {} $pegasus_lite_work_dir/ \; >/dev/null 2>&1

      fixes the issue for me (noting the same command appears also for *.meta two lines later, and needs to be changed there as well). If this change makes sense, could we get it into the 5.0.8 release?

            Assignee:
            Mats Rynge
            Reporter:
            Ian Harry
            Archiver:
            Rajiv Mayani

              Created:
              Updated:
              Resolved:
              Archived: