Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: master, 5.0.0, 4.9.1
Affects Version/s: 4.9.1
Component/s: Monitord
Labels:
None

I'm seeing problem with monitord not exiting cleanly after the condor_dagman process has exited. This has happened a few times, so it's not a one-off. An example is currently job 16715729.0 on sugwg-osg.phy.syr.edu. The dagman process has exited:

11/28/18 12:24:25 **** condor_scheduniv_exec.16715729.0 (condor_DAGMAN) pid 2122725 EXITING WITH STATUS 1

However, monitord has not exited:

[dbrown@sugwg-osg ~]$ ps wwwwaux | grep o1-analysis-3-v1_13_0-LOSC_16_V1-0.dag.dagman.out
dbrown 2251055 10.4 0.2 494072 245656 ? S 11:50 17:39 /usr/bin/python2.7 /usr/bin/pegasus-monitord -N o1-analysis-3-v1_13_0-LOSC_16_V1-0.dag.dagman.out

and so pegasus-dagman has not exited:

dbrown 2122719 0.0 0.0 199784 11668 ? Ss 09:57 0:00 /usr/bin/python2.7 /usr/bin/pegasus-dagman -p 0 -f -l . -Lockfile o1-analysis-3-v1_13_0-LOSC_16_V1-0.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag o1-analysis-3-v1_13_0-LOSC_16_V1-0.dag -MaxPre 1 -MaxPost 20 -Suppress_notification -CsdVersion $CondorVersion: 8.6.7 Oct 29 2017 BuildID: 422776 $ -Dagman /bin/condor_dagman

I'll leave this job in the queue so Karan can investigate.

Assignee:: Karan Vahi
Reporter:: Duncan Brown
Watchers:: 2 Start watching this issue

Created:: 28/Nov/18 11:41 AM
Updated:: 05/Dec/18 11:31 AM
Resolved:: 05/Dec/18 11:31 AM

Details

Description

Attachments

Activity

People

Dates