pegasus-remove on hierarchal workflows results in jobs from the sub workflows still in the condor queue

XMLWordPrintable

    • Type: Bug
    • Resolution: Fixed
    • Priority: Blocker
    • master, 4.7.0, 4.6.1
    • Affects Version/s: master, 4.6.0
    • Component/s: Pegasus Planner
    • None
    • Environment:
      pegasus-remove on a hierarchal workflow jobs in the sub workflow are running in the queue

      I am able to replicate it on 8.4.4 that the job is not removed from the sub workflow, even though the dagman job is removed

      However, we need input from Kent on what we are missing

      – Schedd: workflow.isi.edu : <128.9.132.79:38687?...
      ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
      857944.0 vahi 4/6 12:02 0+00:00:00 X 0 0.0 pegasus-dagman -p
      857949.0 vahi 4/6 12:03 0+00:00:27 R 10 0.1 pegasus-kickstart

      I am attaching the submit directory for my test workflow

      tail -n 25 local-hierarchy-0.dag.dagman.out
      04/06/16 12:02:57 2 0 1 0 0 3 0
      04/06/16 12:02:57 0 job proc(s) currently held
      04/06/16 12:03:02 Currently monitoring 1 Condor log file(s)
      04/06/16 12:03:02 Event: ULOG_EXECUTE for Condor Node subdax_inner_ID0000002 (857947.0.0)
      04/06/16 12:03:02 Number of idle job procs: 0
      04/06/16 12:03:35 Received SIGUSR1
      04/06/16 12:03:35 Aborting DAG...
      04/06/16 12:03:35 Writing Rescue DAG to local-hierarchy-0.dag.rescue001...
      04/06/16 12:03:35 Removing submitted jobs...
      04/06/16 12:03:35 Note: 0 total job deferrals because of -MaxJobs limit (0)
      04/06/16 12:03:35 Note: 0 total job deferrals because of -MaxIdle limit (1000)
      04/06/16 12:03:35 Note: 0 total job deferrals because of node category throttles
      04/06/16 12:03:35 Note: 0 total PRE script deferrals because of -MaxPre limit (20) or DEFER
      04/06/16 12:03:35 Note: 0 total POST script deferrals because of -MaxPost limit (20) or DEFER
      04/06/16 12:03:35 DAG status: 4 (DAG_STATUS_RM)
      04/06/16 12:03:35 Of 6 nodes total:
      04/06/16 12:03:35 Done Pre Queued Post Ready Un-Ready Failed
      04/06/16 12:03:35 === === === === === === ===
      04/06/16 12:03:35 2 0 1 0 0 3 0
      04/06/16 12:03:35 0 job proc(s) currently held
      04/06/16 12:03:35 Wrote metrics file local-hierarchy-0.dag.metrics.
      04/06/16 12:03:35 Reporting metrics to Pegasus metrics server(s); output is in local-hierarchy-0.dag.metrics.out.
      04/06/16 12:03:35 Running command </usr/libexec/condor/condor_dagman_metrics_reporter -f local-hierarchy-0.dag.metrics -s -t 100>
      04/06/16 12:03:35 Warning: mysin has length 0 (ignore if produced by DAGMan; see gittrac #4987, #5031)
      04/06/16 12:03:35 **** condor_scheduniv_exec.857944.0 (condor_DAGMAN) pid 2634796 EXITING WITH STATUS 2

      -bash-4.1$ tail -n 50 ./inner_ID0000002.000/inner-0.dag.dagman.out
      04/06/16 12:03:12 Currently monitoring 1 Condor log file(s)
      04/06/16 12:03:12 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Node create_dir_inner_0_local (857948.0.0)
      04/06/16 12:03:12 POST Script of Node create_dir_inner_0_local completed successfully.
      04/06/16 12:03:12 DAG status: 0 (DAG_STATUS_OK)
      04/06/16 12:03:12 Of 3 nodes total:
      04/06/16 12:03:12 Done Pre Queued Post Ready Un-Ready Failed
      04/06/16 12:03:12 === === === === === === ===
      04/06/16 12:03:12 1 0 0 0 1 1 0
      04/06/16 12:03:12 0 job proc(s) currently held
      04/06/16 12:03:17 Submitting Condor Node sleep_ID0000001 job(s)...
      04/06/16 12:03:17 Adding a DAGMan workflow log /local-scratch/vahi/work/local-hierarchy/work/dags/vahi/pegasus/local-hierarchy/run0006/inner_ID0000002.000/./inner-0.dag.nodes.log
      04/06/16 12:03:17 Masking the events recorded in the DAGMAN workflow log
      04/06/16 12:03:17 Mask for workflow log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27
      04/06/16 12:03:17 submitting: /usr/bin/condor_submit -a dag_node_name' '=' 'sleep_ID0000001 -a +DAGManJobId' '=' '857947 -a DAGManJobId' '=' '857947 -a submit_event_notes' '=' 'DAG' 'Node:' 'sleep_ID0000001 -a dagman_log' '=' '/local-scratch/vahi/work/local-hierarchy/work/dags/vahi/pegasus/local-hierarchy/run0006/inner_ID0000002.000/./inner-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"create_dir_inner_0_local" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never sleep_ID0000001.sub
      04/06/16 12:03:17 From submit: Submitting job(s).
      04/06/16 12:03:17 From submit: 1 job(s) submitted to cluster 857949.
      04/06/16 12:03:17 assigned Condor ID (857949.0.0)
      04/06/16 12:03:17 Just submitted 1 job this cycle...
      04/06/16 12:03:17 Currently monitoring 1 Condor log file(s)
      04/06/16 12:03:17 Reassigning the id of job sleep_ID0000001 from (857949.0.0) to (857949.0.0)
      04/06/16 12:03:17 Event: ULOG_SUBMIT for Condor Node sleep_ID0000001 (857949.0.0)
      04/06/16 12:03:17 Number of idle job procs: 1
      04/06/16 12:03:17 DAG status: 0 (DAG_STATUS_OK)
      04/06/16 12:03:17 Of 3 nodes total:
      04/06/16 12:03:17 Done Pre Queued Post Ready Un-Ready Failed
      04/06/16 12:03:17 === === === === === === ===
      04/06/16 12:03:17 1 0 1 0 0 1 0
      04/06/16 12:03:17 0 job proc(s) currently held
      04/06/16 12:03:22 Currently monitoring 1 Condor log file(s)
      04/06/16 12:03:22 Event: ULOG_EXECUTE for Condor Node sleep_ID0000001 (857949.0.0)
      04/06/16 12:03:22 Number of idle job procs: 0
      04/06/16 12:03:35 Received SIGUSR1
      04/06/16 12:03:35 Aborting DAG...
      04/06/16 12:03:35 Writing Rescue DAG to inner-0.dag.rescue001...
      04/06/16 12:03:35 Removing submitted jobs...
      04/06/16 12:03:35 Note: 0 total job deferrals because of -MaxJobs limit (0)
      04/06/16 12:03:35 Note: 0 total job deferrals because of -MaxIdle limit (1000)
      04/06/16 12:03:35 Note: 0 total job deferrals because of node category throttles
      04/06/16 12:03:35 Note: 0 total PRE script deferrals because of -MaxPre limit (20) or DEFER
      04/06/16 12:03:35 Note: 0 total POST script deferrals because of -MaxPost limit (20) or DEFER
      04/06/16 12:03:35 DAG status: 4 (DAG_STATUS_RM)
      04/06/16 12:03:35 Of 3 nodes total:
      04/06/16 12:03:35 Done Pre Queued Post Ready Un-Ready Failed
      04/06/16 12:03:35 === === === === === === ===
      04/06/16 12:03:35 1 0 1 0 0 1 0
      04/06/16 12:03:35 0 job proc(s) currently held
      04/06/16 12:03:35 Wrote metrics file inner-0.dag.metrics.
      04/06/16 12:03:35 Reporting metrics to Pegasus metrics server(s); output is in inner-0.dag.metrics.out.
      04/06/16 12:03:35 Running command </usr/libexec/condor/condor_dagman_metrics_reporter -f inner-0.dag.metrics -s -t 100>
      04/06/16 12:03:35 **** condor_scheduniv_exec.857947.0 (condor_DAGMAN) pid 2661589 EXITING WITH STATUS 2

            Assignee:
            Karan Vahi
            Reporter:
            Duncan Brown
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: