Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 4.5.0, 4.4.2
Affects Version/s: master, 4.5.0, 4.4.2
Component/s: PMC
Labels:
None

Dear Pegasus support,
I am testing your pegasus-mpi-cluster tool that is of great interest.

On a test case, I encountered a reproducible error.
The bug occurs when mpirun -n value is >= to 10.

$ mpirun -n 11 pegasus-mpi-cluster test.dag2C
Version: 4.5.0cvs
Compiled: Feb 24 2015 15:12:55
Compiler: 4.4.7 20120313 (Red Hat 4.4.7-11)
MPI: 3.0
OpenMPI: 1.8.1
[info] Setting max cached files = 256
[info] Master starting with 10 workers
[info] Starting workflow
[etna0:8937] *** An error occurred in MPI_Recv
[etna0:8937] *** reported by process [139861257027585,18446603344811130880]
[etna0:8937] *** on communicator MPI_COMM_WORLD
[etna0:8937] *** MPI_ERR_TRUNCATE: message truncated
[etna0:8937] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[etna0:8937] *** and potentially your MPI job)

Using this dag file content:
$ cat test.dag2C
TASK 1 -c 2 sh -c "uname -n && echo 1 && sleep 30"
TASK 2 -c 2 sh -c "uname -n && echo 2 && sleep 30"
TASK 3 -c 2 sh -c "uname -n && echo 3 && sleep 30"
TASK 4 -c 2 sh -c "uname -n && echo 4 && sleep 30"
TASK 5 -c 2 sh -c "uname -n && echo 5 && sleep 30"
TASK 6 -c 2 sh -c "uname -n && echo 6 && sleep 30"
TASK 7 -c 2 sh -c "uname -n && echo 7 && sleep 30"
TASK 8 -c 2 sh -c "uname -n && echo 8 && sleep 30"
TASK 9 -c 2 sh -c "uname -n && echo 9 && sleep 30"
TASK 10 -c 2 sh -c "uname -n && echo 10 && sleep 30"

I am on CentOS 6.6 with open MPI 1.8.1

Do you have an idea about the error?
Which version of mpi is recommended?

Thanks for you help,

David

Assignee:: Gideon Juve (Inactive)
Reporter:: Gideon Juve (Inactive)
Watchers:: 1 Start watching this issue

Created:: 13/Mar/15 10:38 AM
Updated:: 13/Mar/15 2:38 PM
Resolved:: 13/Mar/15 2:38 PM

Details

Description

Attachments

Activity

People

Dates