Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1374

make monitord resilient to dagman logging the debug level in dagman.out

XMLWordPrintable

    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Major Major
    • master, 5.0.0, 4.9.2
    • Affects Version/s: master, 4.9.1
    • Component/s: Monitord
    • None

      monitord parsing of dagman.out file breaks if dagman logging is enabled , and the log level gets recorded in the dagman.out file.

      For example snippet below:
      5/10/19 16:12:03 (D_ALWAYS) Adding a DAGMan workflow log /home/nu_vahi/tutorial/pop/submit/nu_vahi/pegasus/population/run0003/./population-0.dag.nodes.log
      05/10/19 16:12:03 (D_ALWAYS) Masking the events recorded in the DAGMAN workflow log
      05/10/19 16:12:03 (D_ALWAYS) Mask for workflow log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27,35,36
      05/10/19 16:12:03 (D_ALWAYS) submitting: /usr/bin/condor_submit -a dag_node_name' '=' 'stage_in_remote_local_1_0 -a +DAGManJobId' '=' '73970 -a DAGManJobId' '=' '73970 -batch-name population-0.dag+73970 -a submit_event_notes' '=' 'DAG' 'Node:' 'stage_in_remote_local_1_0 -a dagman_log' '=' '/home/nu_vahi/tutorial/pop/submit/nu_vahi/pegasus/population/run0003/./population-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27,35,36" -a priority=700 -a +DAGNodeRetry' '=' '0 -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never -a +DAGParentNodeNames' '=' '"create_dir_population_0_local" 00/00/stage_in_remote_local_1_0.sub
      05/10/19 16:12:03 (D_ALWAYS) From submit: Submitting job(s).
      05/10/19 16:12:03 (D_ALWAYS) From submit: 1 job(s) submitted to cluster 73973.
      05/10/19 16:12:03 (D_ALWAYS) From submit: WARNING: the line 'copy_to_spool = false' was unused by condor_submit. Is it a typo?
      05/10/19 16:12:03 (D_ALWAYS) assigned HTCondor ID (73973.0.0)
      05/10/19 16:12:03 (D_ALWAYS) Submitting HTCondor Node stage_in_remote_local_0_0 job(s)...
      05/10/19 16:12:03 (D_ALWAYS) Adding a DAGMan workflow log /home/nu_vahi/tutorial/pop/submit/nu_vahi/pegasus/population/run0003/./population-0.dag.nodes.log
      05/10/19 16:12:03 (D_ALWAYS) Masking the events recorded in the DAGMAN workflow log
      05/10/19 16:12:03 (D_ALWAYS) Mask for workflow log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27,35,36
      05/10/19 16:12:03 (D_ALWAYS) submitting: /usr/bin/condor_submit -a dag_node_name' '=' 'stage_in_remote_local_0_0 -a +DAGManJobId' '=' '73970 -a DAGManJobId' '=' '73970 -batch-name population-0.dag+73970 -a submit_event_notes' '=' 'DAG' 'Node:' 'stage_in_remote_local_0_0 -a dagman_log' '=' '/home/nu_vahi/tutorial/pop/submit/nu_vahi/pegasus/population/run0003/./population-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27,35,36" -a priority=700 -a +DAGNodeRetry' '=' '0 -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never -a +DAGParentNodeNames' '=' '"create_dir_population_0_local" 00/00/stage_in_remote_local_0_0.sub
      05/10/19 16:12:03 (D_ALWAYS) From submit: Submitting job(s).
      05/10/19 16:12:03 (D_ALWAYS) From submit: 1 job(s) submitted to cluster 73974.
      05/10/19 16:12:03 (D_ALWAYS) From submit: WARNING: the line 'copy_to_spool = false' was unused by condor_submit. Is it a typo?
      05/10/19 16:12:03 (D_ALWAYS) assigned HTCondor ID (73974.0.0)
      05/10/19 16:12:03 (D_ALWAYS) Submitting HTCondor Node stage_in_local_local_0_0 job(s)...
      05/10/19 16:12:03 (D_ALWAYS) Adding a DAGMan workflow log /home/nu_vahi/tutorial/pop/submit/nu_vahi/pegasus/pop

      this causes, the invocation and job instance tables to not be populated
      the parsing regex's should be updated to ignore the logging of the log level if detected.

            Assignee:
            vahi Karan Vahi
            Reporter:
            vahi Karan Vahi
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: