have a relatively small workflow with 3 task types (A, B, C) and 64 tasks each. The first two task types are independent and the third is dependent on them both, so essentially A1 and B1 must both run, then C1 can run. I'm running this using an LSF script from the command line, and requesting 2 nodes (84 cores).
The problem I'm encountering is that I'm getting seg faults on some (not all) of these tasks. When I run the tasks on the command line, I don't get these seg faults, so I think it has something to do with the PMC environment. I don't think it's a memory issue, as the nodes have 512 GB each and my largest jobs only use about 200 MB.
The core dumps I get all seem to point to here:
#0 0x00002000007563dc in __add_to_environ () from /lib64/power9/libc.so.6
#1 0x0000000010036978 in TaskHandler::child_process (this=0x7fffcdb53a88)
at worker.cpp:287
#2 0x0000000010038234 in TaskHandler::run_process (this=0x7fffcdb53a88)
at worker.cpp:494
#3 0x0000000010038f24 in TaskHandler::execute (this=0x7fffcdb53a88)
at worker.cpp:667
#4 0x000000001003a350 in Worker::run (this=0x7fffcdb54240) at worker.cpp:966
#5 0x0000000010007d04 in mpidag (argc=9, argv=0x7fffcdb54ad8, comm=...)
at pegasus-mpi-cluster.cpp:415
#6 0x00000000100088bc in main (argc=9, argv=0x7fffcdb54ad8)
at pegasus-mpi-cluster.cpp:442
That line is:
if (setenv("PMC_TASK", this->name.c_str(), 1) < 0) {
so I guess the issue is on the setenv() call.
What's particularly odd about all this is that if I request 4 nodes (168 cores) in my batch script, I don't have this issue. That's obviously a temporary solution, but it makes me nervous and I'd like to understand what's going on. Let me know if you have any ideas for what to check, or how to debug this further. Thanks! I'm using the PMC that ships with Pegasus 5.0.1.