make monitord resilient to dagman logging the debug level in dagman.out

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

XMLWordPrintable

    • Type: Improvement
    • Resolution: Fixed
    • Priority: Major
    • master, 5.0.0, 4.9.2
    • Affects Version/s: master, 4.9.1
    • Component/s: Monitord
    • None

      monitord parsing of dagman.out file breaks if dagman logging is enabled , and the log level gets recorded in the dagman.out file.

      For example snippet below:
      5/10/19 16:12:03 (D_ALWAYS) Adding a DAGMan workflow log /home/nu_vahi/tutorial/pop/submit/nu_vahi/pegasus/population/run0003/./population-0.dag.nodes.log
      05/10/19 16:12:03 (D_ALWAYS) Masking the events recorded in the DAGMAN workflow log
      05/10/19 16:12:03 (D_ALWAYS) Mask for workflow log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27,35,36
      05/10/19 16:12:03 (D_ALWAYS) submitting: /usr/bin/condor_submit -a dag_node_name' '=' 'stage_in_remote_local_1_0 -a +DAGManJobId' '=' '73970 -a DAGManJobId' '=' '73970 -batch-name population-0.dag+73970 -a submit_event_notes' '=' 'DAG' 'Node:' 'stage_in_remote_local_1_0 -a dagman_log' '=' '/home/nu_vahi/tutorial/pop/submit/nu_vahi/pegasus/population/run0003/./population-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27,35,36" -a priority=700 -a +DAGNodeRetry' '=' '0 -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never -a +DAGParentNodeNames' '=' '"create_dir_population_0_local" 00/00/stage_in_remote_local_1_0.sub
      05/10/19 16:12:03 (D_ALWAYS) From submit: Submitting job(s).
      05/10/19 16:12:03 (D_ALWAYS) From submit: 1 job(s) submitted to cluster 73973.
      05/10/19 16:12:03 (D_ALWAYS) From submit: WARNING: the line 'copy_to_spool = false' was unused by condor_submit. Is it a typo?
      05/10/19 16:12:03 (D_ALWAYS) assigned HTCondor ID (73973.0.0)
      05/10/19 16:12:03 (D_ALWAYS) Submitting HTCondor Node stage_in_remote_local_0_0 job(s)...
      05/10/19 16:12:03 (D_ALWAYS) Adding a DAGMan workflow log /home/nu_vahi/tutorial/pop/submit/nu_vahi/pegasus/population/run0003/./population-0.dag.nodes.log
      05/10/19 16:12:03 (D_ALWAYS) Masking the events recorded in the DAGMAN workflow log
      05/10/19 16:12:03 (D_ALWAYS) Mask for workflow log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27,35,36
      05/10/19 16:12:03 (D_ALWAYS) submitting: /usr/bin/condor_submit -a dag_node_name' '=' 'stage_in_remote_local_0_0 -a +DAGManJobId' '=' '73970 -a DAGManJobId' '=' '73970 -batch-name population-0.dag+73970 -a submit_event_notes' '=' 'DAG' 'Node:' 'stage_in_remote_local_0_0 -a dagman_log' '=' '/home/nu_vahi/tutorial/pop/submit/nu_vahi/pegasus/population/run0003/./population-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27,35,36" -a priority=700 -a +DAGNodeRetry' '=' '0 -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never -a +DAGParentNodeNames' '=' '"create_dir_population_0_local" 00/00/stage_in_remote_local_0_0.sub
      05/10/19 16:12:03 (D_ALWAYS) From submit: Submitting job(s).
      05/10/19 16:12:03 (D_ALWAYS) From submit: 1 job(s) submitted to cluster 73974.
      05/10/19 16:12:03 (D_ALWAYS) From submit: WARNING: the line 'copy_to_spool = false' was unused by condor_submit. Is it a typo?
      05/10/19 16:12:03 (D_ALWAYS) assigned HTCondor ID (73974.0.0)
      05/10/19 16:12:03 (D_ALWAYS) Submitting HTCondor Node stage_in_local_local_0_0 job(s)...
      05/10/19 16:12:03 (D_ALWAYS) Adding a DAGMan workflow log /home/nu_vahi/tutorial/pop/submit/nu_vahi/pegasus/pop

      this causes, the invocation and job instance tables to not be populated
      the parsing regex's should be updated to ignore the logging of the log level if detected.

        1. pop.tgz
          9.80 MB
          Karan Vahi

            Assignee:
            Karan Vahi
            Reporter:
            Karan Vahi
            Archiver:
            Rajiv Mayani

              Created:
              Updated:
              Resolved:
              Archived: