NVSHMEM™ is a parallel programming interface based on OpenSHMEM that provides efficient and scalable communication for NVIDIA GPU clusters. NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA® streams.
Efficient, Strong Scaling
NVSHMEM enables long-running kernels that include both communication and computation, reducing overheads that can limit an application’s performance when strong scaling.
One-sided communication primitives reduce overhead by allowing the initiating process or GPU thread to specify all information required to complete a data transfer. This low-overhead model enables many GPU threads to communicate efficiently.
Asynchronous communications make it easier for programmers to interleave computation and communication, thereby increasing overall application performance.
What's new in NVSHMEM 2.4.1
- Added limited support for Multiple Processes per GPU (MPG) on x86 platforms.
- The amount of support depends on the availability of CUDA MPS.
- MPG support is currently not available on Power 9 platforms.
- Added a local buffer registration API that allows non-symmetric buffers to be used as local buffers in the NVSHMEM API.
- Added support for dynamic symmetric heap allocation, which eliminates the need to specify NVSHMEM_SYMMETRIC_SIZE.
- On P9 platforms, this feature is disabled by default, and can be enabled by using the NVSHMEM_CUDA_DISABLE_VMM environment variable.
- On x86 platforms, this feature is is enabled by default, and is available with CUDA version 11.3 or later.
- Support for large RMA messages.
- To build NVSHMEM without ibrc support, set NVSHMEM_IBRC_SUPPORT=0 in the environment before you build.
- This allows you to build and run NVSHMEM without the GDRCopy and OFED dependencies.
- Support for calling nvshmem_init/finalize multiple times with an MPI bootstrap.
- Improved testing coverage (large messages, exercising full GPU memory, and so on).
- Improved the default PE to NIC assignment for NVIDIA DGX-2™ systems.
- Optimized channel request processing by using the CPU proxy thread.
- Added support for the shmem_global_exit API.
- Removed redundant barriers to improve the collectives’ performance.
- Significant code refactoring to use templates instead of macros for internal functions.
- Improved performance for device-side blocking RMA and strided RMA APIs.
- Bug fix for buffers with large offsets into the NVSHMEM symmetric heap.
- Combines the memory of multiple GPUs into a partitioned global address space that’s accessed through NVSHMEM APIs
- Includes a low-overhead, in-kernel communication API for use by GPU threads
- Includes stream-based and CPU-initiated communication APIs
- Supports x86 and POWER9 processors
- Is interoperable with MPI and other OpenSHMEM implementations
Convolution is a compute-intensive kernel that’s used in a wide variety of applications, including image processing, machine learning, and scientific computing. Spatial parallelization decomposes the domain into sub-partitions that are distributed over multiple GPUs with nearest-neighbor communications, often referred to as halo exchanges.
In the Livermore Big Artificial Neural Network (LBANN) deep learning framework, spatial-parallel convolution is implemented using several communication methods, including MPI and NVSHMEM. The MPI-based halo exchange uses the standard send and receive primitives, whereas the NVSHMEM-based implementation uses one-sided put, yielding significant performance improvements on Lawrence Livermore National Laboratory’s Sierra supercomputer.
Efficient Strong-Scaling on Sierra Supercomputer
Efficient Strong-Scaling on NVIDIA DGX SuperPOD
Accelerate Time to Solution
Reducing the time to solution for high-performance, scientific computing workloads generally requires a strong-scalable application. QUDA is a library for lattice quantum chromodynamics (QCD) on GPUs, and it’s used by the popular MIMD Lattice Computation (MILC) and Chroma codes.
NVSHMEM-enabled QUDA avoids CPU-GPU synchronization for communication, thereby reducing critical-path latencies and significantly improving strong-scaling efficiency.
The conjugate gradient (CG) method is a popular numerical approach to solving systems of linear equations, and CGSolve is an implementation of this method in the Kokkos programming model. The CGSolve kernel showcases the use of NVSHMEM as a building block for higher-level programming models like Kokkos.
NVSHMEM enables efficient multi-node and multi-GPU execution using Kokkos global array data structures without requiring explicit code for communication between GPUs. As a result, NVSHMEM-enabled Kokkos significantly simplifies development compared to using MPI and CUDA.
Productive Programming of Kokkos CGSolve
- Introductory Webinar
- NVSHMEM Documentation
- NVSHMEM API Documentation
- OpenSHMEM Specification
- For questions or to provide feedback, please contact email@example.com
- Related libraries and software:
Ready to start developing with NVSHMEM?