Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1145

monitord can get stuck on large workflows with sub-daxes

XMLWordPrintable

    • Type: Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Priority: Icon: Major Major
    • None
    • Affects Version/s: 4.8.0
    • Component/s: Monitord
    • None

      pegasus-monitord gets is getting stuck on a large dax with sub-workflows. The monitord process just sits in an endless loop doing stat() on the dagman.out files:

      select(0, NULL, NULL, NULL,

      {9, 999984}

      ) = 0 (Timeout)
      stat("/usr1/simonisa.selmon/pycbc-tmp.hcEy8AuGsb/work/o1-analysis-2-v1.5.8-main_ID0000001.000/H1L1-INJECTION_MINIFOLLOWUP_BBH02_INJ-1127271617-1027800.000/H1L1-INJECTION_MINIFOLLOWUP_BBH02_INJ-1127271617-1027800.dag.dagman.out",

      {st_mode=S_IFREG|0644, st_size=1548814, ...}

      ) = 0
      stat("/usr1/simonisa.selmon/pycbc-tmp.hcEy8AuGsb/work/o1-analysis-2-v1.5.8-main_ID0000001.000/H1L1-INJECTION_MINIFOLLOWUP_NSBH01_INJ-1127271617-1027800.000/H1L1-INJECTION_MINIFOLLOWUP_NSBH01_INJ-1127271617-1027800.dag.dagman.out",

      {st_mode=S_IFREG|0644, st_size=1577922, ...}

      ) = 0
      stat("/usr1/simonisa.selmon/pycbc-tmp.hcEy8AuGsb/work/o1-analysis-2-v1.5.8-main_ID0000001.000/H1L1-INJECTION_MINIFOLLOWUP_NSBH02_INJ-1127271617-1027800.000/H1L1-INJECTION_MINIFOLLOWUP_NSBH02_INJ-1127271617-1027800.dag.dagman.out",

      {st_mode=S_IFREG|0644, st_size=1574259, ...}

      ) = 0
      stat("/home/simonisa.selmon/.pegasus/workflow.db",

      {st_mode=S_IFREG|0644, st_size=95232, ...}) = 0
      open("/home/simonisa.selmon/.pegasus/workflow.db", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 3
      fstat(3, {st_mode=S_IFREG|0644, st_size=95232, ...}

      ) = 0
      fstat(3,

      {st_mode=S_IFREG|0644, st_size=95232, ...}) = 0
      stat("/home/simonisa.selmon/.pegasus/workflow.db", {st_mode=S_IFREG|0644, st_size=95232, ...}

      ) = 0
      lseek(3, 0, SEEK_SET) = 0
      read(3, "SQLite format 3\0\4\0\1\1\0@ \0\0\1\3\0\0\0]"..., 100) = 100
      fstat(3,

      {st_mode=S_IFREG|0644, st_size=95232, ...}) = 0
      stat("/home/simonisa.selmon/.pegasus/workflow.db", {st_mode=S_IFREG|0644, st_size=95232, ...}

      ) = 0
      close(3) = 0
      select(0, NULL, NULL, NULL,

      {9, 997828}

      However, dagman is done and there will never be an update to these files:

      11/22/16 17:09:06 Of 255 nodes total:
      11/22/16 17:09:06 Done Pre Queued Post Ready Un-Ready Failed
      11/22/16 17:09:06 === === === === === === ===
      11/22/16 17:09:06 255 0 0 0 0 0 0
      11/22/16 17:09:06 0 job proc(s) currently held
      11/22/16 17:09:06 Wrote metrics file H1L1-INJECTION_MINIFOLLOWUP_NSBH01_INJ-1127271617-1027800.dag.metrics.
      11/22/16 17:09:06 Reporting metrics to Pegasus metrics server(s); output is in H1L1-INJECTION_MINIFOLLOWUP_NSBH01_INJ-1127271617-1027800.dag.metrics.out.
      11/22/16 17:09:06 Running command </usr/libexec/condor/condor_dagman_metrics_reporter -f H1L1-INJECTION_MINIFOLLOWUP_NSBH01_INJ-1127271617-1027800.dag.metrics -t 100>
      11/22/16 17:09:06 Warning: mysin has length 0 (ignore if produced by DAGMan; see gittrac #4987, #5031)
      11/22/16 17:09:06 **** condor_scheduniv_exec.2990255.0 (condor_DAGMAN) pid 947281 EXITING WITH STATUS 0

      The only solution seems to be to kill monitord and do a --replay to rebuid the database.

            Assignee:
            vahi Karan Vahi
            Reporter:
            dbrown Duncan Brown
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: