Dear Pegasus support,
I am testing your pegasus-mpi-cluster tool that is of great interest.
On a test case, I encountered a reproducible error.
The bug occurs when mpirun -n value is >= to 10.
$ mpirun -n 11 pegasus-mpi-cluster test.dag2C
Version: 4.5.0cvs
Compiled: Feb 24 2015 15:12:55
Compiler: 4.4.7 20120313 (Red Hat 4.4.7-11)
MPI: 3.0
OpenMPI: 1.8.1
[info] Setting max cached files = 256
[info] Master starting with 10 workers
[info] Starting workflow
[etna0:8937] *** An error occurred in MPI_Recv
[etna0:8937] *** reported by process [139861257027585,18446603344811130880]
[etna0:8937] *** on communicator MPI_COMM_WORLD
[etna0:8937] *** MPI_ERR_TRUNCATE: message truncated
[etna0:8937] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[etna0:8937] *** and potentially your MPI job)
Using this dag file content:
$ cat test.dag2C
TASK 1 -c 2 sh -c "uname -n && echo 1 && sleep 30"
TASK 2 -c 2 sh -c "uname -n && echo 2 && sleep 30"
TASK 3 -c 2 sh -c "uname -n && echo 3 && sleep 30"
TASK 4 -c 2 sh -c "uname -n && echo 4 && sleep 30"
TASK 5 -c 2 sh -c "uname -n && echo 5 && sleep 30"
TASK 6 -c 2 sh -c "uname -n && echo 6 && sleep 30"
TASK 7 -c 2 sh -c "uname -n && echo 7 && sleep 30"
TASK 8 -c 2 sh -c "uname -n && echo 8 && sleep 30"
TASK 9 -c 2 sh -c "uname -n && echo 9 && sleep 30"
TASK 10 -c 2 sh -c "uname -n && echo 10 && sleep 30"
I am on CentOS 6.6 with open MPI 1.8.1
Do you have an idea about the error?
Which version of mpi is recommended?
Thanks for you help,
David