pegasus-monitord fails when a job fails because condor submit fails

XMLWordPrintable

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major
    • master, 4.5.0, 4.4.3
    • Affects Version/s: master
    • Component/s: Monitord
    • None
    • Environment:

      04/06/15 20:05:04 Currently monitoring 1 Condor log file(s)
      04/06/15 20:05:04 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Node create_dir_blackdiamond_0_condorpool (2083632.0.0)
      04/06/15 20:05:04 POST Script of Node create_dir_blackdiamond_0_condorpool completed successfully.
      04/06/15 20:05:04 Of 26 nodes total:
      04/06/15 20:05:04 Done Pre Queued Post Ready Un-Ready Failed
      04/06/15 20:05:04 === === === === === === ===
      04/06/15 20:05:04 1 0 0 0 5 20 0
      04/06/15 20:05:04 0 job proc(s) currently held
      04/06/15 20:05:09 Submitting Condor Node stage_in_remote_condorpool_0_0 job(s)...
      04/06/15 20:05:09 Adding a DAGMan auxiliary log /lfs1/software/bamboo/data/xml-data/build-dir/PEGASUS-WT-T19A/test/core/019-black-label/work/bamboo/pegasus/blackdiamond/20150406T200413-0700/./blackdiamond-0.dag.nodes.log
      04/06/15 20:05:09 Masking the events recorded in the DAGMAN auxiliary log
      04/06/15 20:05:09 Mask for auxiliary log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27
      04/06/15 20:05:09 submitting: condor_submit -a dag_node_name' '=' 'stage_in_remote_condorpool_0_0 -a +DAGManJobId' '=' '2083629 -a DAGManJobId' '=' '2083629 -a submit_event_notes' '=' 'DAG' 'Node:' 'stage_in_remote_condorpool_0_0 -a dagman_log' '=' '/lfs1/software/bamboo/data/xml-data/build-dir/PEGASUS-WT-T19A/test/core/019-black-label/work/bamboo/pegasus/blackdiamond/20150406T200413-0700/./blackdiamond-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"create_dir_blackdiamond_0_condorpool" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never stage_in_remote_condorpool_0_0.sub
      04/06/15 20:05:09 From submit: Submitting job(s)
      04/06/15 20:05:09 From submit: ERROR: Can't open "/tmp/x509up_u550" with flags 00 (No such file or directory)
      04/06/15 20:05:09 failed while reading from pipe.
      04/06/15 20:05:09 Read so far: Submitting job(s)ERROR: Can't open "/tmp/x509up_u550" with flags 00 (No such file or directory)
      04/06/15 20:05:09 ERROR: submit attempt failed
      04/06/15 20:05:09 submit command was: condor_submit -a dag_node_name' '=' 'stage_in_remote_condorpool_0_0 -a +DAGManJobId' '=' '2083629 -a DAGManJobId' '=' '2083629 -a submit_event_notes' '=' 'DAG' 'Node:' 'stage_in_remote_condorpool_0_0 -a dagman_log' '=' '/lfs1/software/bamboo/data/xml-data/build-dir/PEGASUS-WT-T19A/test/core/019-black-label/work/bamboo/pegasus/blackdiamond/20150406T200413-0700/./blackdiamond-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"create_dir_blackdiamond_0_condorpool" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never stage_in_remote_condorpool_0_0.sub
      04/06/15 20:05:09 Job submit try 1/6 failed, will try again in >= 1 second.
      04/06/15 20:05:14 Submitting Condor Node stage_in_remote_condorpool_0_0 job(s)...
      04/06/15 20:05:14 Adding a DAGMan auxiliary log /lfs1/software/bamboo/data/xml-data/build-dir/PEGASUS-WT-T19A/test/core/019-black-label/work/bamboo/pegasus/blackdiamond/20150406T200413-0700/./blackdiamond-0.dag.nodes.log
      04/06/15 20:05:14 Masking the events recorded in the DAGMAN auxiliary log
      04/06/15 20:05:14 Mask for auxiliary log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27
      04/06/15 20:05:14 submitting: condor_submit -a dag_node_name' '=' 'stage_in_remote_condorpool_0_0 -a +DAGManJobId' '=' '2083629 -a DAGManJobId' '=' '2083629 -a submit_event_notes' '=' 'DAG' 'Node:' 'stage_in_remote_condorpool_0_0 -a dagman_log' '=' '/lfs1/software/bamboo/data/xml-data/build-dir/PEGASUS-WT-T19A/test/core/019-black-label/work/bamboo/pegasus/blackdiamond/20150406T200413-0700/./blackdiamond-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"create_dir_blackdiamond_0_condorpool" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never stage_in_remote_condorpool_0_0.sub
      04/06/15 20:05:14 From submit: Submitting job(s)
      04/06/15 20:05:14 From submit: ERROR: Can't open "/tmp/x509up_u550" with flags 00 (No such file or directory)
      04/06/15 20:05:14 failed while reading from pipe.
      04/06/15 20:05:14 Read so far: Submitting job(s)ERROR: Can't open "/tmp/x509up_u550" with flags 00 (No such file or directory)
      04/06/15 20:05:14 ERROR: submit attempt failed
      04/06/15 20:05:14 submit command was: condor_submit -a dag_node_name' '=' 'stage_in_remote_condorpool_0_0 -a +DAGManJobId' '=' '2083629 -a DAGManJobId' '=' '2083629 -a submit_event_notes' '=' 'DAG' 'Node:' 'stage_in_remote_condorpool_0_0 -a dagman_log' '=' '/lfs1/software/bamboo/data/xml-data/build-dir/PEGASUS-WT-T19A/test/core/019-black-label/work/bamboo/pegasus/blackdiamond/20150406T200413-0700/./blackdiamond-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"create_dir_blackdiamond_0_condorpool" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never stage_in_remote_condorpool_0_0.sub
      04/06/15 20:05:14 Job submit try 2/6 failed, will try again in >= 2 seconds.
      04/06/15 20:05:19 Submitting Condor Node stage_in_remote_condorpool_0_0 job(s)...
      04/06/15 20:05:19 Adding a DAGMan auxiliary log /lfs1/software/bamboo/data/xml-data/build-dir/PEGASUS-WT-T19A/test/core/019-black-label/work/bamboo/pegasus/blackdiamond/20150406T200413-0700/./blackdiamond-0.dag.nodes.log
      04/06/15 20:05:19 Masking the events recorded in the DAGMAN auxiliary log
      04/06/15 20:05:19 Mask for auxiliary log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27
      04/06/15 20:05:19 submitting: condor_submit -a dag_node_name' '=' 'stage_in_remote_condorpool_0_0 -a +DAGManJobId' '=' '2083629 -a DAGManJobId' '=' '2083629 -a submit_event_notes' '=' 'DAG' 'Node:' 'stage_in_remote_condorpool_0_0 -a dagman_log' '=' '/lfs1/software/bamboo/data/xml-data/build-dir/PEGASUS-WT-T19A/test/core/019-black-label/work/bamboo/pegasus/blackdiamond/20150406T200413-0700/./blackdiamond-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"create_dir_blackdiamond_0_condorpool" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never stage_in_remote_condorpool_0_0.sub
      04/06/15 20:05:19 From submit: Submitting job(s)
      04/06/15 20:05:19 From submit: ERROR: Can't open "/tmp/x509up_u550" with flags 00 (No such file or directory)
      04/06/15 20:05:19 failed while reading from pipe.
      04/06/15 20:05:19 Read so far: Submitting job(s)ERROR: Can't open "/tmp/x509up_u550" with flags 00 (No such file or directory)
      04/06/15 20:05:19 ERROR: submit attempt failed
      04/06/15 20:05:19 submit command was: condor_submit -a dag_node_name' '=' 'stage_in_remote_condorpool_0_0 -a +DAGManJobId' '=' '2083629 -a DAGManJobId' '=' '2083629 -a submit_event_notes' '=' 'DAG' 'Node:' 'stage_in_remote_condorpool_0_0 -a dagman_log' '=' '/lfs1/software/bamboo/data/xml-data/build-dir/PEGASUS-WT-T19A/test/core/019-black-label/work/bamboo/pegasus/blackdiamond/20150406T200413-0700/./blackdiamond-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"create_dir_blackdiamond_0_condorpool" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never stage_in_remote_condorpool_0_0.sub
      04/06/15 20:05:19 Job submit try 3/6 failed, will try again in >= 4 seconds.
      04/06/15 20:05:24 Submitting Condor Node stage_in_remote_condorpool_0_0 job(s)...
      04/06/15 20:05:24 Adding a DAGMan auxiliary log /lfs1/software/bamboo/data/xml-data/build-dir/PEGASUS-WT-T19A/test/core/019-black-label/work/bamboo/pegasus/blackdiamond/20150406T200413-0700/./blackdiamond-0.dag.nodes.log
      04/06/15 20:05:24 Masking the events recorded in the DAGMAN auxiliary log
      04/06/15 20:05:24 Mask for auxiliary log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27
      04/06/15 20:05:24 submitting: condor_submit -a dag_node_name' '=' 'stage_in_remote_condorpool_0_0 -a +DAGManJobId' '=' '2083629 -a DAGManJobId' '=' '2083629 -a submit_event_notes' '=' 'DAG' 'Node:' 'stage_in_remote_condorpool_0_0 -a dagman_log' '=' '/lfs1/software/bamboo/data/xml-data/build-dir/PEGASUS-WT-T19A/test/core/019-black-label/work/bamboo/pegasus/blackdiamond/20150406T200413-0700/./blackdiamond-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27" -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"create_dir_blackdiamond_0_condorpool" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never stage_in_remote_condorpool_0_0.sub
      04/06/15 20:05:24 From submit: Submitting job(s)
      04/06/15 20:05:24 From submit: ERROR: Can't open "/tmp/x509up_u550" with flags 00 (No such file or directory)
      04/06/15 20:05:24 failed while reading from pipe.
      04/06/15 20:05:24 Read so far: Submitting job(s)ERROR: Can't open "/tmp/x509up_u550" with flags 00 (No such file or directory)
      04/06/15 20:05:24 ERROR: submit attempt failed

            Assignee:
            Karan Vahi
            Reporter:
            Karan Vahi
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: