I ran a FG periodogram workflow, which is all "vanilla" jobs but most resources are remote (Condor-I/O). After the workflow is long gone, there's still:
26025 ? S 0:44 python /home/voeckler/src/svn/pegasus/trunk/bin/pegasus-monitord periodogram-0.dag.dagman.out
which according to "strace -p 26025" is doing nothing by sleeps of 100ms:
select(0, NULL, NULL, NULL,
{0, 100000}) = 0 (Timeout)select(0, NULL, NULL, NULL, {0, 100000}
) = 0 (Timeout)
Here are some files:
$ cat monitord.log
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib64/python2.4/threading.py", line 442, in __bootstrap
self.run()
File "/home/voeckler/src/svn/pegasus/trunk/lib/python/netlogger/analysis/modules/_base.py", line 282, in run
self.queue.task_done()
AttributeError: Queue instance has no attribute 'task_done'
The "monitord.done" file was written, but it is still there! Maybe something wrong with your thread handling? Or maybe you final condition didn't match properly:
$ tail periodogram-0.dag.dagman.out
11/04/10 19:32:37 1599 0 0 0 0 0 0
11/04/10 19:32:37 0 job proc(s) currently held
11/04/10 19:32:37 Note: 176726 total job deferrals because of -MaxIdle limit (100)
11/04/10 19:32:37 All jobs Completed!
11/04/10 19:32:37 Note: 0 total job deferrals because of -MaxJobs limit (0)
11/04/10 19:32:37 Note: 176726 total job deferrals because of -MaxIdle limit (100)
11/04/10 19:32:37 Note: 0 total job deferrals because of node category throttles
11/04/10 19:32:37 Note: 0 total PRE script deferrals because of -MaxPre limit (20)
11/04/10 19:32:37 Note: 0 total POST script deferrals because of -MaxPost limit (100)
11/04/10 19:32:37 **** condor_scheduniv_exec.12.0 (condor_DAGMAN) pid 26022 EXITING WITH STATUS 0