When running multi-threaded codes under pegasus-mpi-cluster on Stampede, I observed all threads for a task being bunched together on a single core. I believe this is due to Linux CPU affinity, as when I cleared the affinity mask with:
taskset -pc 0-15 99657
The task spread out and the 4 threads used 4 cores.
My guess is that the affinity is inherited from the MPI launcher. I think we should clear the affinity with sched_setaffinity() inside pegasus-mpi-cluster, and the easiest is probably to just pass 0 to sched_setaffinity and clear it for pegasus-mpi-cluster and forked processes.