Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-1145

monitord can get stuck on large workflows with sub-daxes

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 4.8.0
    • Fix Version/s: None
    • Component/s: Monitord
    • Labels:
      None

      Description

      pegasus-monitord gets is getting stuck on a large dax with sub-workflows. The monitord process just sits in an endless loop doing stat() on the dagman.out files:

      select(0, NULL, NULL, NULL, {9, 999984}
      ) = 0 (Timeout)
      stat("/usr1/simonisa.selmon/pycbc-tmp.hcEy8AuGsb/work/o1-analysis-2-v1.5.8-main_ID0000001.000/H1L1-INJECTION_MINIFOLLOWUP_BBH02_INJ-1127271617-1027800.000/H1L1-INJECTION_MINIFOLLOWUP_BBH02_INJ-1127271617-1027800.dag.dagman.out", {st_mode=S_IFREG|0644, st_size=1548814, ...}) = 0
      stat("/usr1/simonisa.selmon/pycbc-tmp.hcEy8AuGsb/work/o1-analysis-2-v1.5.8-main_ID0000001.000/H1L1-INJECTION_MINIFOLLOWUP_NSBH01_INJ-1127271617-1027800.000/H1L1-INJECTION_MINIFOLLOWUP_NSBH01_INJ-1127271617-1027800.dag.dagman.out", {st_mode=S_IFREG|0644, st_size=1577922, ...}) = 0
      stat("/usr1/simonisa.selmon/pycbc-tmp.hcEy8AuGsb/work/o1-analysis-2-v1.5.8-main_ID0000001.000/H1L1-INJECTION_MINIFOLLOWUP_NSBH02_INJ-1127271617-1027800.000/H1L1-INJECTION_MINIFOLLOWUP_NSBH02_INJ-1127271617-1027800.dag.dagman.out", {st_mode=S_IFREG|0644, st_size=1574259, ...}) = 0
      stat("/home/simonisa.selmon/.pegasus/workflow.db", {st_mode=S_IFREG|0644, st_size=95232, ...}) = 0
      open("/home/simonisa.selmon/.pegasus/workflow.db", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 3
      fstat(3, {st_mode=S_IFREG|0644, st_size=95232, ...}) = 0
      fstat(3, {st_mode=S_IFREG|0644, st_size=95232, ...}) = 0
      stat("/home/simonisa.selmon/.pegasus/workflow.db", {st_mode=S_IFREG|0644, st_size=95232, ...}) = 0
      lseek(3, 0, SEEK_SET) = 0
      read(3, "SQLite format 3\0\4\0\1\1\0@ \0\0\1\3\0\0\0]"..., 100) = 100
      fstat(3, {st_mode=S_IFREG|0644, st_size=95232, ...}) = 0
      stat("/home/simonisa.selmon/.pegasus/workflow.db", {st_mode=S_IFREG|0644, st_size=95232, ...}) = 0
      close(3) = 0
      select(0, NULL, NULL, NULL, {9, 997828}

      However, dagman is done and there will never be an update to these files:

      11/22/16 17:09:06 Of 255 nodes total:
      11/22/16 17:09:06 Done Pre Queued Post Ready Un-Ready Failed
      11/22/16 17:09:06 === === === === === === ===
      11/22/16 17:09:06 255 0 0 0 0 0 0
      11/22/16 17:09:06 0 job proc(s) currently held
      11/22/16 17:09:06 Wrote metrics file H1L1-INJECTION_MINIFOLLOWUP_NSBH01_INJ-1127271617-1027800.dag.metrics.
      11/22/16 17:09:06 Reporting metrics to Pegasus metrics server(s); output is in H1L1-INJECTION_MINIFOLLOWUP_NSBH01_INJ-1127271617-1027800.dag.metrics.out.
      11/22/16 17:09:06 Running command </usr/libexec/condor/condor_dagman_metrics_reporter -f H1L1-INJECTION_MINIFOLLOWUP_NSBH01_INJ-1127271617-1027800.dag.metrics -t 100>
      11/22/16 17:09:06 Warning: mysin has length 0 (ignore if produced by DAGMan; see gittrac #4987, #5031)
      11/22/16 17:09:06 **** condor_scheduniv_exec.2990255.0 (condor_DAGMAN) pid 947281 EXITING WITH STATUS 0

      The only solution seems to be to kill monitord and do a --replay to rebuid the database.

        Attachments

          Activity

            People

            • Assignee:
              vahi Karan Vahi
              Reporter:
              dbrown Duncan Brown
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: