-
Type: Bug
-
Resolution: Cannot Reproduce
-
Priority: Major
-
None
-
Affects Version/s: 4.8.0
-
Component/s: Monitord
-
None
pegasus-monitord gets is getting stuck on a large dax with sub-workflows. The monitord process just sits in an endless loop doing stat() on the dagman.out files:
select(0, NULL, NULL, NULL,
{9, 999984}) = 0 (Timeout)
stat("/usr1/simonisa.selmon/pycbc-tmp.hcEy8AuGsb/work/o1-analysis-2-v1.5.8-main_ID0000001.000/H1L1-INJECTION_MINIFOLLOWUP_BBH02_INJ-1127271617-1027800.000/H1L1-INJECTION_MINIFOLLOWUP_BBH02_INJ-1127271617-1027800.dag.dagman.out",
) = 0
stat("/usr1/simonisa.selmon/pycbc-tmp.hcEy8AuGsb/work/o1-analysis-2-v1.5.8-main_ID0000001.000/H1L1-INJECTION_MINIFOLLOWUP_NSBH01_INJ-1127271617-1027800.000/H1L1-INJECTION_MINIFOLLOWUP_NSBH01_INJ-1127271617-1027800.dag.dagman.out",
) = 0
stat("/usr1/simonisa.selmon/pycbc-tmp.hcEy8AuGsb/work/o1-analysis-2-v1.5.8-main_ID0000001.000/H1L1-INJECTION_MINIFOLLOWUP_NSBH02_INJ-1127271617-1027800.000/H1L1-INJECTION_MINIFOLLOWUP_NSBH02_INJ-1127271617-1027800.dag.dagman.out",
) = 0
stat("/home/simonisa.selmon/.pegasus/workflow.db",
open("/home/simonisa.selmon/.pegasus/workflow.db", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=95232, ...}
) = 0
fstat(3,
stat("/home/simonisa.selmon/.pegasus/workflow.db", {st_mode=S_IFREG|0644, st_size=95232, ...}
) = 0
lseek(3, 0, SEEK_SET) = 0
read(3, "SQLite format 3\0\4\0\1\1\0@ \0\0\1\3\0\0\0]"..., 100) = 100
fstat(3,
stat("/home/simonisa.selmon/.pegasus/workflow.db", {st_mode=S_IFREG|0644, st_size=95232, ...}
) = 0
close(3) = 0
select(0, NULL, NULL, NULL,
However, dagman is done and there will never be an update to these files:
11/22/16 17:09:06 Of 255 nodes total:
11/22/16 17:09:06 Done Pre Queued Post Ready Un-Ready Failed
11/22/16 17:09:06 === === === === === === ===
11/22/16 17:09:06 255 0 0 0 0 0 0
11/22/16 17:09:06 0 job proc(s) currently held
11/22/16 17:09:06 Wrote metrics file H1L1-INJECTION_MINIFOLLOWUP_NSBH01_INJ-1127271617-1027800.dag.metrics.
11/22/16 17:09:06 Reporting metrics to Pegasus metrics server(s); output is in H1L1-INJECTION_MINIFOLLOWUP_NSBH01_INJ-1127271617-1027800.dag.metrics.out.
11/22/16 17:09:06 Running command </usr/libexec/condor/condor_dagman_metrics_reporter -f H1L1-INJECTION_MINIFOLLOWUP_NSBH01_INJ-1127271617-1027800.dag.metrics -t 100>
11/22/16 17:09:06 Warning: mysin has length 0 (ignore if produced by DAGMan; see gittrac #4987, #5031)
11/22/16 17:09:06 **** condor_scheduniv_exec.2990255.0 (condor_DAGMAN) pid 947281 EXITING WITH STATUS 0
The only solution seems to be to kill monitord and do a --replay to rebuid the database.