monitord hangs around for too long after DAGMan finished

XMLWordPrintable

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major
    • 3.0
    • Affects Version/s: master
    • Component/s: Monitord
    • None

      I ran a FG periodogram workflow, which is all "vanilla" jobs but most resources are remote (Condor-I/O). After the workflow is long gone, there's still:

      26025 ? S 0:44 python /home/voeckler/src/svn/pegasus/trunk/bin/pegasus-monitord periodogram-0.dag.dagman.out

      which according to "strace -p 26025" is doing nothing by sleeps of 100ms:

      select(0, NULL, NULL, NULL,

      {0, 100000}) = 0 (Timeout)
      select(0, NULL, NULL, NULL, {0, 100000}

      ) = 0 (Timeout)

      Here are some files:

      $ cat monitord.log
      Exception in thread Thread-1:
      Traceback (most recent call last):
      File "/usr/lib64/python2.4/threading.py", line 442, in __bootstrap
      self.run()
      File "/home/voeckler/src/svn/pegasus/trunk/lib/python/netlogger/analysis/modules/_base.py", line 282, in run
      self.queue.task_done()
      AttributeError: Queue instance has no attribute 'task_done'

      The "monitord.done" file was written, but it is still there! Maybe something wrong with your thread handling? Or maybe you final condition didn't match properly:

      $ tail periodogram-0.dag.dagman.out
      11/04/10 19:32:37 1599 0 0 0 0 0 0
      11/04/10 19:32:37 0 job proc(s) currently held
      11/04/10 19:32:37 Note: 176726 total job deferrals because of -MaxIdle limit (100)
      11/04/10 19:32:37 All jobs Completed!
      11/04/10 19:32:37 Note: 0 total job deferrals because of -MaxJobs limit (0)
      11/04/10 19:32:37 Note: 176726 total job deferrals because of -MaxIdle limit (100)
      11/04/10 19:32:37 Note: 0 total job deferrals because of node category throttles
      11/04/10 19:32:37 Note: 0 total PRE script deferrals because of -MaxPre limit (20)
      11/04/10 19:32:37 Note: 0 total POST script deferrals because of -MaxPost limit (100)
      11/04/10 19:32:37 **** condor_scheduniv_exec.12.0 (condor_DAGMAN) pid 26022 EXITING WITH STATUS 0

            Assignee:
            Unassigned
            Reporter:
            Jens Voeckler
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved: