-
Type: Bug
-
Resolution: Fixed
-
Priority: Blocker
-
Affects Version/s: master, 4.7.3
-
Component/s: Pegasus Lite, Pegasus Planner
-
None
Any idea what could be causing:
5508942.0 simonisa.selmon 2/14 11:18 Error from slot1_12@CRUSH-SUGWG-OSG-10-5-149-25: SHADOW at 128.230.146.18 failed to send file(s) to <128.230.18.42:9130>: error reading from /usr1/simonisa.selmon/pycbc-tmp.THiqywp09A/work/././o1-analysis-1-mean-v1.5.8-main_ID0000001.000/pegasus-worker-4.7.3-x86_64_rhel_7.tar.gz: (errno 2) No such file or
Indeed it's not there:
[root@sugwg-osg ~]# ls -l /usr1/simonisa.selmon/pycbc-tmp.THiqywp09A/work/././o1-analysis-1-mean-v1.5.8-main_ID0000001.000/pegasus-worker-4.7.3-x86_64_rhel_7.tar.gz
ls: cannot access /usr1/simonisa.selmon/pycbc-tmp.THiqywp09A/work/././o1-analysis-1-mean-v1.5.8-main_ID0000001.000/pegasus-worker-4.7.3-x86_64_rhel_7.tar.gz: No such file or directory
If I copy the missing worker in manually and release the jobs then it works. The strange thing is that this breaks part of the way through the workflow. Here, it's working fine:
02/14/17 01:39:41 Of 20154 nodes total:
02/14/17 01:39:41 Done Pre Queued Post Ready Un-Ready Failed
02/14/17 01:39:41 === === === === === === ===
02/14/17 01:39:41 19339 0 257 1 0 556 1
02/14/17 01:39:41 0 job proc(s) currently held
02/14/17 01:39:41 Note: 5086742 total job deferrals because of -MaxIdle limit (5000)
02/14/17 01:39:41 Note: 109 total POST script deferrals because of -MaxPost limit (20) or DEFER
02/14/17 01:39:41 Initializing user log writer for /usr1/simonisa.selmon/pycbc-tmp.THiqywp09A/work/o1-analysis-1-mean-v1.5.8-main_ID0000001.000/./o1-analysis-1-mean-v1.5.8-main-0.dag.nodes.log, (5502693.0.0)
02/14/17 01:39:46 Currently monitoring 1 HTCondor log file(s)
02/14/17 01:39:46 Event: ULOG_POST_SCRIPT_TERMINATED for HTCondor Node inspiral-FULL_DATA-L1_ID17_ID0005579 (5502693.0.0)
02/14/17 01:39:46 POST Script of node inspiral-FULL_DATA-L1_ID17_ID0005579 completed successfully.
But then suddenly it starts to fail:
02/14/17 05:20:10 Submitting HTCondor Node page_ifar-OPEN_BOX-H1L1_ID36_ID0006568 job(s)...
02/14/17 05:20:10 Adding a DAGMan workflow log /usr1/simonisa.selmon/pycbc-tmp.THiqywp09A/work/o1-analysis-1-mean-v1.5.8-main_ID0000001.000/./o1-analysis-1-mean-v1.5.8-main-0.dag.nodes.log
02/14/17 05:20:10 Masking the events recorded in the DAGMAN workflow log
02/14/17 05:20:10 Mask for workflow log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27
02/14/17 05:20:10 submitting: /usr/bin/condor_submit -a dag_node_name' '=' 'page_ifar-OPEN_BOX-H1L1_ID36_ID0006568 -a +DAGManJobId' '=' '5485353 -a DAGManJobId' '=' '5485353 -batch-name o1-analysis-1-mean-v1.5.8-0.dag+5485128 -a submit_event_notes' '=' 'DAG' 'Node:' 'page_ifar-OPEN_BOX-H1L1_ID36_ID0006568 -a dagman_log' '=' '/usr1/simonisa.selmo
n/pycbc-tmp.THiqywp09A/work/o1-analysis-1-mean-v1.5.8-main_ID0000001.000/./o1-analysis-1-mean-v1.5.8-main-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a priority=1000 -a DAG_STATUS' '=' '2 -a FAILED_COUNT' '=' '1 -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never -a +DAGParentNodeNames' '=' '"combine_sta
tmap-FULL_DATA_FULL_CUMULATIVE_CAT_12H-H1L1_ID28_ID0006560" ./page_ifar-OPEN_BOX-H1L1_ID36_ID0006568.sub
02/14/17 05:20:10 Reassigning the id of job page_ifar-OPEN_BOX-H1L1_ID36_ID0006568 from (5523174.0.0) to (5523174.0.0)
02/14/17 05:20:10 Event: ULOG_SUBMIT for HTCondor Node page_ifar-OPEN_BOX-H1L1_ID36_ID0006568 (5523174.0.0)
02/14/17 05:21:05 Event: ULOG_SHADOW_EXCEPTION for HTCondor Node page_ifar-OPEN_BOX-H1L1_ID36_ID0006568 (5523174.0.0)
{02/14/17 05:21:01}02/14/17 05:21:05 Number of idle job procs: 6
02/14/17 05:21:05 Event: ULOG_JOB_HELD for HTCondor Node page_ifar-OPEN_BOX-H1L1_ID36_ID0006568 (5523174.0.0)
02/14/17 05:21:05 Hold reason: Error from slot1_10@CRUSH-SUGWG-OSG-10-5-148-243: SHADOW at 128.230.146.18 failed to send file(s) to <128.230.18.42:20622>: error reading from /usr1/simonisa.selmon/pycbc-tmp.THiqywp09A/work/././o1-analysis-1-mean-v1.5.8-main_ID0000001.000/pegasus-worker-4.7.3-x86_64_rhel_7.tar.gz: (errno 2) No such file or directory; STARTER failed to receive file(s) from <128.230.146.18:9615>
I've seen the same thing with one of my DAGs, but I just wrote it off as me doing something stupid.
Any idea what is removing the worker package? There's plenty of disk:
[root@sugwg-osg o1-analysis-1-mean-v1.5.8-main_ID0000001]# df -h /usr1/
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 2.0T 656G 1.4T 33% /usr1