-
Type: Bug
-
Resolution: Fixed
-
Priority: Blocker
-
Affects Version/s: master, 4.6.0
-
Component/s: Pegasus Planner
-
None
-
Environment:pegasus-remove on a hierarchal workflow jobs in the sub workflow are running in the queue
I am able to replicate it on 8.4.4 that the job is not removed from the sub workflow, even though the dagman job is removed
However, we need input from Kent on what we are missing
– Schedd: workflow.isi.edu : <128.9.132.79:38687?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
857944.0 vahi 4/6 12:02 0+00:00:00 X 0 0.0 pegasus-dagman -p
857949.0 vahi 4/6 12:03 0+00:00:27 R 10 0.1 pegasus-kickstart
I am attaching the submit directory for my test workflow
tail -n 25 local-hierarchy-0.dag.dagman.out
04/06/16 12:02:57 2 0 1 0 0 3 0
04/06/16 12:02:57 0 job proc(s) currently held
04/06/16 12:03:02 Currently monitoring 1 Condor log file(s)
04/06/16 12:03:02 Event: ULOG_EXECUTE for Condor Node subdax_inner_ID0000002 (857947.0.0)
04/06/16 12:03:02 Number of idle job procs: 0
04/06/16 12:03:35 Received SIGUSR1
04/06/16 12:03:35 Aborting DAG...
04/06/16 12:03:35 Writing Rescue DAG to local-hierarchy-0.dag.rescue001...
04/06/16 12:03:35 Removing submitted jobs...
04/06/16 12:03:35 Note: 0 total job deferrals because of -MaxJobs limit (0)
04/06/16 12:03:35 Note: 0 total job deferrals because of -MaxIdle limit (1000)
04/06/16 12:03:35 Note: 0 total job deferrals because of node category throttles
04/06/16 12:03:35 Note: 0 total PRE script deferrals because of -MaxPre limit (20) or DEFER
04/06/16 12:03:35 Note: 0 total POST script deferrals because of -MaxPost limit (20) or DEFER
04/06/16 12:03:35 DAG status: 4 (DAG_STATUS_RM)
04/06/16 12:03:35 Of 6 nodes total:
04/06/16 12:03:35 Done Pre Queued Post Ready Un-Ready Failed
04/06/16 12:03:35 === === === === === === ===
04/06/16 12:03:35 2 0 1 0 0 3 0
04/06/16 12:03:35 0 job proc(s) currently held
04/06/16 12:03:35 Wrote metrics file local-hierarchy-0.dag.metrics.
04/06/16 12:03:35 Reporting metrics to Pegasus metrics server(s); output is in local-hierarchy-0.dag.metrics.out.
04/06/16 12:03:35 Running command </usr/libexec/condor/condor_dagman_metrics_reporter -f local-hierarchy-0.dag.metrics -s -t 100>
04/06/16 12:03:35 Warning: mysin has length 0 (ignore if produced by DAGMan; see gittrac #4987, #5031)
04/06/16 12:03:35 **** condor_scheduniv_exec.857944.0 (condor_DAGMAN) pid 2634796 EXITING WITH STATUS 2
-bash-4.1$ tail -n 50 ./inner_ID0000002.000/inner-0.dag.dagman.out
04/06/16 12:03:12 Currently monitoring 1 Condor log file(s)
04/06/16 12:03:12 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Node create_dir_inner_0_local (857948.0.0)
04/06/16 12:03:12 POST Script of Node create_dir_inner_0_local completed successfully.
04/06/16 12:03:12 DAG status: 0 (DAG_STATUS_OK)
04/06/16 12:03:12 Of 3 nodes total:
04/06/16 12:03:12 Done Pre Queued Post Ready Un-Ready Failed
04/06/16 12:03:12 === === === === === === ===
04/06/16 12:03:12 1 0 0 0 1 1 0
04/06/16 12:03:12 0 job proc(s) currently held
04/06/16 12:03:17 Submitting Condor Node sleep_ID0000001 job(s)...
04/06/16 12:03:17 Adding a DAGMan workflow log /local-scratch/vahi/work/local-hierarchy/work/dags/vahi/pegasus/local-hierarchy/run0006/inner_ID0000002.000/./inner-0.dag.nodes.log
04/06/16 12:03:17 Masking the events recorded in the DAGMAN workflow log
04/06/16 12:03:17 Mask for workflow log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27
04/06/16 12:03:17 submitting: /usr/bin/condor_submit -a dag_node_name' '=' 'sleep_ID0000001 -a +DAGManJobId' '=' '857947 -a DAGManJobId' '=' '857947 -a submit_event_notes' '=' 'DAG' 'Node:' 'sleep_ID0000001 -a dagman_log' '=' '/local-scratch/vahi/work/local-hierarchy/work/dags/vahi/pegasus/local-hierarchy/run0006/inner_ID0000002.000/./inner-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"create_dir_inner_0_local" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never sleep_ID0000001.sub
04/06/16 12:03:17 From submit: Submitting job(s).
04/06/16 12:03:17 From submit: 1 job(s) submitted to cluster 857949.
04/06/16 12:03:17 assigned Condor ID (857949.0.0)
04/06/16 12:03:17 Just submitted 1 job this cycle...
04/06/16 12:03:17 Currently monitoring 1 Condor log file(s)
04/06/16 12:03:17 Reassigning the id of job sleep_ID0000001 from (857949.0.0) to (857949.0.0)
04/06/16 12:03:17 Event: ULOG_SUBMIT for Condor Node sleep_ID0000001 (857949.0.0)
04/06/16 12:03:17 Number of idle job procs: 1
04/06/16 12:03:17 DAG status: 0 (DAG_STATUS_OK)
04/06/16 12:03:17 Of 3 nodes total:
04/06/16 12:03:17 Done Pre Queued Post Ready Un-Ready Failed
04/06/16 12:03:17 === === === === === === ===
04/06/16 12:03:17 1 0 1 0 0 1 0
04/06/16 12:03:17 0 job proc(s) currently held
04/06/16 12:03:22 Currently monitoring 1 Condor log file(s)
04/06/16 12:03:22 Event: ULOG_EXECUTE for Condor Node sleep_ID0000001 (857949.0.0)
04/06/16 12:03:22 Number of idle job procs: 0
04/06/16 12:03:35 Received SIGUSR1
04/06/16 12:03:35 Aborting DAG...
04/06/16 12:03:35 Writing Rescue DAG to inner-0.dag.rescue001...
04/06/16 12:03:35 Removing submitted jobs...
04/06/16 12:03:35 Note: 0 total job deferrals because of -MaxJobs limit (0)
04/06/16 12:03:35 Note: 0 total job deferrals because of -MaxIdle limit (1000)
04/06/16 12:03:35 Note: 0 total job deferrals because of node category throttles
04/06/16 12:03:35 Note: 0 total PRE script deferrals because of -MaxPre limit (20) or DEFER
04/06/16 12:03:35 Note: 0 total POST script deferrals because of -MaxPost limit (20) or DEFER
04/06/16 12:03:35 DAG status: 4 (DAG_STATUS_RM)
04/06/16 12:03:35 Of 3 nodes total:
04/06/16 12:03:35 Done Pre Queued Post Ready Un-Ready Failed
04/06/16 12:03:35 === === === === === === ===
04/06/16 12:03:35 1 0 1 0 0 1 0
04/06/16 12:03:35 0 job proc(s) currently held
04/06/16 12:03:35 Wrote metrics file inner-0.dag.metrics.
04/06/16 12:03:35 Reporting metrics to Pegasus metrics server(s); output is in inner-0.dag.metrics.out.
04/06/16 12:03:35 Running command </usr/libexec/condor/condor_dagman_metrics_reporter -f inner-0.dag.metrics -s -t 100>
04/06/16 12:03:35 **** condor_scheduniv_exec.857947.0 (condor_DAGMAN) pid 2661589 EXITING WITH STATUS 2
- blocks
-
PM-1085 -p 0 options for condor_dagman sub dax jobs result in dagman ( 8.2.8) dying
- Resolved