MPI (Message Passing Interface) is a standardized and portable API for communicating data via messages (both point-to-point & collective) between distributed processes. MPI is frequently used in HPC to build applications that can scale on multi-node computer clusters. In most MPI implementations, library routines are directly callable from C, C++, and Fortran, as well as other languages able to interface with such libraries

MPI is fully compatible with CUDA, CUDA Fortran, and OpenACC, all of which are designed for parallel computing on a single computer or node. There are a number of reasons for wanting to combine the complementary parallel programming approaches of MPI & CUDA (/CUDA Fortran/OpenACC):

  • To solve problems with a data size too large to fit into the memory of a single GPU
  • To solve problems that would require unreasonably long compute time on a single node
  • To accelerate an existing MPI application with GPUs
  • To enable a single-node multi-GPU application to scale across multiple nodes

Regular MPI implementations pass pointers to host memory, staging GPU buffers through host memory using cudaMemcopy.

With CUDA-aware MPI, the MPI library can send and receive GPU buffers directly, without having to first stage them in host memory. Implementation of CUDA-aware MPI was simplified by Unified Virtual Addressing (UVA) in CUDA 4.0 – which enables a single address space for all CPU and GPU memory. CUDA-aware implementations of MPI have several advantages:

  • CUDA-aware MPI is relatively easy to use
  • Applications run more efficiently with CUDA-aware MPI
    • Operations that carry out the message transfers can be pipelined
    • CUDA-aware MPI takes advantage of best GPUDirect technology available

With Kepler class and later GPUs & Hyper-Q, multiple MPI processes can share the GPU.

Implementations of CUDA-aware MPI are available from several sources:

MVAPICH2 is an open source implementation of Message Passing Interface (MPI) and simplifies the task of porting MPI applications to run on clusters with NVIDIA GPUs by supporting standard MPI calls from GPU device memory.

IBM™ Spectrum MPI is a high-performance, production-quality implementation of MPI designed to accelerate application performance in distributed computing environments.

The Open MPI Project is an open source MPI-2 implementation that is developed and maintained by a consortium of academic, research, and industry partners. GPUs are supported by version 1.7 and later.

Have a problem with your application or want to share some tips?  
Try posting on the CUDA Developer forums and benefit from the collective wisdom of thousands of GPU developers.

Check out the rest of the CUDA Tools and Ecosystem