Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1164

worker package in submit directory gets deleted during workflow run

    Details

      Description

      Any idea what could be causing:

      5508942.0 simonisa.selmon 2/14 11:18 Error from slot1_12@CRUSH-SUGWG-OSG-10-5-149-25: SHADOW at 128.230.146.18 failed to send file(s) to <128.230.18.42:9130>: error reading from /usr1/simonisa.selmon/pycbc-tmp.THiqywp09A/work/././o1-analysis-1-mean-v1.5.8-main_ID0000001.000/pegasus-worker-4.7.3-x86_64_rhel_7.tar.gz: (errno 2) No such file or

      Indeed it's not there:

      [root@sugwg-osg ~]# ls -l /usr1/simonisa.selmon/pycbc-tmp.THiqywp09A/work/././o1-analysis-1-mean-v1.5.8-main_ID0000001.000/pegasus-worker-4.7.3-x86_64_rhel_7.tar.gz
      ls: cannot access /usr1/simonisa.selmon/pycbc-tmp.THiqywp09A/work/././o1-analysis-1-mean-v1.5.8-main_ID0000001.000/pegasus-worker-4.7.3-x86_64_rhel_7.tar.gz: No such file or directory

      If I copy the missing worker in manually and release the jobs then it works. The strange thing is that this breaks part of the way through the workflow. Here, it's working fine:

      02/14/17 01:39:41 Of 20154 nodes total:
      02/14/17 01:39:41 Done Pre Queued Post Ready Un-Ready Failed
      02/14/17 01:39:41 === === === === === === ===
      02/14/17 01:39:41 19339 0 257 1 0 556 1
      02/14/17 01:39:41 0 job proc(s) currently held
      02/14/17 01:39:41 Note: 5086742 total job deferrals because of -MaxIdle limit (5000)
      02/14/17 01:39:41 Note: 109 total POST script deferrals because of -MaxPost limit (20) or DEFER
      02/14/17 01:39:41 Initializing user log writer for /usr1/simonisa.selmon/pycbc-tmp.THiqywp09A/work/o1-analysis-1-mean-v1.5.8-main_ID0000001.000/./o1-analysis-1-mean-v1.5.8-main-0.dag.nodes.log, (5502693.0.0)
      02/14/17 01:39:46 Currently monitoring 1 HTCondor log file(s)
      02/14/17 01:39:46 Event: ULOG_POST_SCRIPT_TERMINATED for HTCondor Node inspiral-FULL_DATA-L1_ID17_ID0005579 (5502693.0.0) {02/14/17 01:39:41}
      02/14/17 01:39:46 POST Script of node inspiral-FULL_DATA-L1_ID17_ID0005579 completed successfully.

      But then suddenly it starts to fail:

      02/14/17 05:20:10 Submitting HTCondor Node page_ifar-OPEN_BOX-H1L1_ID36_ID0006568 job(s)...
      02/14/17 05:20:10 Adding a DAGMan workflow log /usr1/simonisa.selmon/pycbc-tmp.THiqywp09A/work/o1-analysis-1-mean-v1.5.8-main_ID0000001.000/./o1-analysis-1-mean-v1.5.8-main-0.dag.nodes.log
      02/14/17 05:20:10 Masking the events recorded in the DAGMAN workflow log
      02/14/17 05:20:10 Mask for workflow log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27
      02/14/17 05:20:10 submitting: /usr/bin/condor_submit -a dag_node_name' '=' 'page_ifar-OPEN_BOX-H1L1_ID36_ID0006568 -a +DAGManJobId' '=' '5485353 -a DAGManJobId' '=' '5485353 -batch-name o1-analysis-1-mean-v1.5.8-0.dag+5485128 -a submit_event_notes' '=' 'DAG' 'Node:' 'page_ifar-OPEN_BOX-H1L1_ID36_ID0006568 -a dagman_log' '=' '/usr1/simonisa.selmo
      n/pycbc-tmp.THiqywp09A/work/o1-analysis-1-mean-v1.5.8-main_ID0000001.000/./o1-analysis-1-mean-v1.5.8-main-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a priority=1000 -a DAG_STATUS' '=' '2 -a FAILED_COUNT' '=' '1 -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never -a +DAGParentNodeNames' '=' '"combine_sta
      tmap-FULL_DATA_FULL_CUMULATIVE_CAT_12H-H1L1_ID28_ID0006560" ./page_ifar-OPEN_BOX-H1L1_ID36_ID0006568.sub

      02/14/17 05:20:10 Reassigning the id of job page_ifar-OPEN_BOX-H1L1_ID36_ID0006568 from (5523174.0.0) to (5523174.0.0)
      02/14/17 05:20:10 Event: ULOG_SUBMIT for HTCondor Node page_ifar-OPEN_BOX-H1L1_ID36_ID0006568 (5523174.0.0) {02/14/17 05:20:10}

      02/14/17 05:21:05 Event: ULOG_SHADOW_EXCEPTION for HTCondor Node page_ifar-OPEN_BOX-H1L1_ID36_ID0006568 (5523174.0.0) {02/14/17 05:21:01}
      02/14/17 05:21:05 Number of idle job procs: 6
      02/14/17 05:21:05 Event: ULOG_JOB_HELD for HTCondor Node page_ifar-OPEN_BOX-H1L1_ID36_ID0006568 (5523174.0.0) {02/14/17 05:21:01}
      02/14/17 05:21:05 Hold reason: Error from slot1_10@CRUSH-SUGWG-OSG-10-5-148-243: SHADOW at 128.230.146.18 failed to send file(s) to <128.230.18.42:20622>: error reading from /usr1/simonisa.selmon/pycbc-tmp.THiqywp09A/work/././o1-analysis-1-mean-v1.5.8-main_ID0000001.000/pegasus-worker-4.7.3-x86_64_rhel_7.tar.gz: (errno 2) No such file or directory; STARTER failed to receive file(s) from <128.230.146.18:9615>

      I've seen the same thing with one of my DAGs, but I just wrote it off as me doing something stupid.

      Any idea what is removing the worker package? There's plenty of disk:

      [root@sugwg-osg o1-analysis-1-mean-v1.5.8-main_ID0000001]# df -h /usr1/
      Filesystem Size Used Avail Use% Mounted on
      /dev/sdb1 2.0T 656G 1.4T 33% /usr1

        Attachments

          Activity

            People

            • Assignee:
              vahi Karan Vahi
              Reporter:
              dbrown Duncan Brown
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: