MVAPICH

MVAPICH2 is an open source implementation of Message Passing Interface (MPI) that delivers the best performance, scalability and fault tolerance for high-end computing systems and servers using InfiniBand, 10GigE/iWARP and RoCE networking technologies. MVAPICH2 simplifies the task of porting MPI applications to run on clusters with NVIDIA GPUs by supporting standard MPI calls from GPU device memory. It optimizes the data movement between host and GPU, and between GPUs in the best way possible while requiring minimal or no effort from the application developer.

Key Features:

High performance RDMA-based inter-node MPI point-to-point communication from/to GPU device memory (GPU-GPU, GPU-Host and Host-GPU)
High performance intra-node MPI point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
Optimized and tuned MPI collective communication from/to GPU device memory
MPI Datatype support in point-to-point and collective communication from/to GPU device memory
Taking advantage of CUDA IPC (available in CUDA 4.1) in intra-node communication for multiple GPU adapters/node
Efficient synchronization mechanism using CUDA Events for pipelined data transfers from/to GPU device memory

Performance:

The latest performance results using MVAPICH2 for MPI communication from/to/between GPU devices can be found on the OSU Microbenchmark Page

Availability:

The latest version of MVAPICH2 can be downloaded from http://mvapich.cse.ohio-state.edu/register/ NVIDIA GPU related features are available in MVAPICH2 releases starting from 1.8.

MVAPICH Project Page:

http://mvapich.cse.ohio-state.edu/overview/mvapich2/features.shtml