Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1334

pegasus dagman is not exiting cleanly

XMLWordPrintable

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major Major
    • master, 5.0.0, 4.9.1
    • Affects Version/s: 4.9.1
    • Component/s: Monitord
    • None

      I'm seeing problem with monitord not exiting cleanly after the condor_dagman process has exited. This has happened a few times, so it's not a one-off. An example is currently job 16715729.0 on sugwg-osg.phy.syr.edu. The dagman process has exited:

      11/28/18 12:24:25 **** condor_scheduniv_exec.16715729.0 (condor_DAGMAN) pid 2122725 EXITING WITH STATUS 1

      However, monitord has not exited:

      [dbrown@sugwg-osg ~]$ ps wwwwaux | grep o1-analysis-3-v1_13_0-LOSC_16_V1-0.dag.dagman.out
      dbrown 2251055 10.4 0.2 494072 245656 ? S 11:50 17:39 /usr/bin/python2.7 /usr/bin/pegasus-monitord -N o1-analysis-3-v1_13_0-LOSC_16_V1-0.dag.dagman.out

      and so pegasus-dagman has not exited:

      dbrown 2122719 0.0 0.0 199784 11668 ? Ss 09:57 0:00 /usr/bin/python2.7 /usr/bin/pegasus-dagman -p 0 -f -l . -Lockfile o1-analysis-3-v1_13_0-LOSC_16_V1-0.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag o1-analysis-3-v1_13_0-LOSC_16_V1-0.dag -MaxPre 1 -MaxPost 20 -Suppress_notification -CsdVersion $CondorVersion: 8.6.7 Oct 29 2017 BuildID: 422776 $ -Dagman /bin/condor_dagman

      I'll leave this job in the queue so Karan can investigate.

            Assignee:
            vahi Karan Vahi
            Reporter:
            dbrown Duncan Brown
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: