Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1374

make monitord resilient to dagman logging the debug level in dagman.out

XMLWordPrintable

    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Major Major
    • master, 5.0.0, 4.9.2
    • Affects Version/s: master, 4.9.1
    • Component/s: Monitord
    • None

      monitord parsing of dagman.out file breaks if dagman logging is enabled , and the log level gets recorded in the dagman.out file.

      For example snippet below:
      5/10/19 16:12:03 (D_ALWAYS) Adding a DAGMan workflow log /home/nu_vahi/tutorial/pop/submit/nu_vahi/pegasus/population/run0003/./population-0.dag.nodes.log
      05/10/19 16:12:03 (D_ALWAYS) Masking the events recorded in the DAGMAN workflow log
      05/10/19 16:12:03 (D_ALWAYS) Mask for workflow log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27,35,36
      05/10/19 16:12:03 (D_ALWAYS) submitting: /usr/bin/condor_submit -a dag_node_name' '=' 'stage_in_remote_local_1_0 -a +DAGManJobId' '=' '73970 -a DAGManJobId' '=' '73970 -batch-name population-0.dag+73970 -a submit_event_notes' '=' 'DAG' 'Node:' 'stage_in_remote_local_1_0 -a dagman_log' '=' '/home/nu_vahi/tutorial/pop/submit/nu_vahi/pegasus/population/run0003/./population-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27,35,36" -a priority=700 -a +DAGNodeRetry' '=' '0 -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never -a +DAGParentNodeNames' '=' '"create_dir_population_0_local" 00/00/stage_in_remote_local_1_0.sub
      05/10/19 16:12:03 (D_ALWAYS) From submit: Submitting job(s).
      05/10/19 16:12:03 (D_ALWAYS) From submit: 1 job(s) submitted to cluster 73973.
      05/10/19 16:12:03 (D_ALWAYS) From submit: WARNING: the line 'copy_to_spool = false' was unused by condor_submit. Is it a typo?
      05/10/19 16:12:03 (D_ALWAYS) assigned HTCondor ID (73973.0.0)
      05/10/19 16:12:03 (D_ALWAYS) Submitting HTCondor Node stage_in_remote_local_0_0 job(s)...
      05/10/19 16:12:03 (D_ALWAYS) Adding a DAGMan workflow log /home/nu_vahi/tutorial/pop/submit/nu_vahi/pegasus/population/run0003/./population-0.dag.nodes.log
      05/10/19 16:12:03 (D_ALWAYS) Masking the events recorded in the DAGMAN workflow log
      05/10/19 16:12:03 (D_ALWAYS) Mask for workflow log is 0,1,2,4,5,7,9,10,11,12,13,16,17,24,27,35,36
      05/10/19 16:12:03 (D_ALWAYS) submitting: /usr/bin/condor_submit -a dag_node_name' '=' 'stage_in_remote_local_0_0 -a +DAGManJobId' '=' '73970 -a DAGManJobId' '=' '73970 -batch-name population-0.dag+73970 -a submit_event_notes' '=' 'DAG' 'Node:' 'stage_in_remote_local_0_0 -a dagman_log' '=' '/home/nu_vahi/tutorial/pop/submit/nu_vahi/pegasus/population/run0003/./population-0.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27,35,36" -a priority=700 -a +DAGNodeRetry' '=' '0 -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never -a +DAGParentNodeNames' '=' '"create_dir_population_0_local" 00/00/stage_in_remote_local_0_0.sub
      05/10/19 16:12:03 (D_ALWAYS) From submit: Submitting job(s).
      05/10/19 16:12:03 (D_ALWAYS) From submit: 1 job(s) submitted to cluster 73974.
      05/10/19 16:12:03 (D_ALWAYS) From submit: WARNING: the line 'copy_to_spool = false' was unused by condor_submit. Is it a typo?
      05/10/19 16:12:03 (D_ALWAYS) assigned HTCondor ID (73974.0.0)
      05/10/19 16:12:03 (D_ALWAYS) Submitting HTCondor Node stage_in_local_local_0_0 job(s)...
      05/10/19 16:12:03 (D_ALWAYS) Adding a DAGMan workflow log /home/nu_vahi/tutorial/pop/submit/nu_vahi/pegasus/pop

      this causes, the invocation and job instance tables to not be populated
      the parsing regex's should be updated to ignore the logging of the log level if detected.

        1. pop.tgz
          9.80 MB
          Karan Vahi

            Assignee:
            vahi Karan Vahi
            Reporter:
            vahi Karan Vahi
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: