-
Type: Improvement
-
Resolution: Fixed
-
Priority: Major
-
Affects Version/s: master
-
Component/s: Worker Tools
-
None
On Kraken, we noticed that the pegasus-mpi-cluster stdout/stderr was not streamed back to the submit host for one of the subworkflows.
Relevant email
Scott,
In case you didn't know, we traced the root cause of the missing stdout/stderr files. It seems that the merge_5.out.000 job was held three times and was aborted with condor_rm. This caused a problem because it became detached from the PBS job, which kept running. DAGMan submitted the 001 job, which started running before the 000 job finished. When 000 finished, it unlinked the stdout/stderr files being accessed by 001, which caused the errors in 001.
If you upgrade Pegasus on the submit host (to get the latest pegasus-exitcode) and on Kraken (to get the latest pegasus-mpi-cluster) the problem should be reduced. pegasus-exitcode has been changed so that outputs with tasks=0 are not failures. And pegasus-mpi-cluster now has more unique stdout/stderr file names, so it should be very unlikely to have a conflict.
In the event that two pegasus-mpi-cluster jobs start running the same DAG at the same time, we cannot guarantee, even if one or both jobs complete successfully, that the workflow outputs are not corrupted. We need mutual exclusion to guarantee that.
I am going to implement a change to pegasus-mpi-cluster to try and flock() the DAG file to prevent races. I'm not sure if that will work on Kraken's Lustre file system (depends on how they configured it). I'll let you know when that is done and we can test it out. Hopefully this will guarantee that, if there is a pegasus-mpi-cluster job running, no other pegasus-mpi-cluster jobs can use the same DAG.
We thought about using a lock file, but we can't guarantee that if the pegasus-mpi-cluster job is aborted or killed or runs out of wall time that the lock file will be cleaned up. flock() gives better guarantees in that regard.
Gideon