NVSHMEM
NVSHMEM™ is a parallel programming interface based on OpenSHMEM that provides efficient and scalable communication for NVIDIA GPU clusters. NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA® streams.
Efficient, Strong Scaling
NVSHMEM enables long-running kernels that include both communication and computation, reducing overheads that can limit an application’s performance when strong scaling.
Low Overhead
One-sided communication primitives reduce overhead by allowing the initiating process or GPU thread to specify all information required to complete a data transfer. This low-overhead model enables many GPU threads to communicate efficiently.
Naturally Asynchronous
Asynchronous communications make it easier for programmers to interleave computation and communication, thereby increasing overall application performance.
What's New in NVSHMEM 3.3
- Enabled GA platform support for Blackwell B200/GB200NVL72-based systems. Additionally, enabled SASS support for Ada architecture.
- Added official Python language bindings (nvshmem4py) enabling symmetric memory management, on-stream RMA, and collective APIs to aid in development of custom kernels using symmetric memory and enable fine-grained communication in native Python. The nvshmem4py package is available via PyPI wheels/conda installers.
- Added support for CUDA Templates for Linear Algebra Subroutines and Solvers (cuTLASS)-compliant tile-granular NVLS device-sided collectives to aid development of fused distributed GeMM kernels.
- Added support for flexible team initialization API (nvshmemx_team_init) using an arbitrary set of PEs to enable non-linear, non-contiguous PE indexing, if desired.
- Added support for symmetric user-buffer registration (nvshmemx_buffer_register_symmetric) to enable ML frameworks to “bring-your-own-buffer” (BYOB) for zero-copy communication kernels.
- Added support for narrow types (float16, bfloat16), precision support for NVLS reducescattercollective, and LL8 fcollect algorithm for low-latency collectives.
- Added support for device-side nvshmem_broadcastmem, nvshmem_fcollectmem APIs in the library.
- Added support for CUDA module-independent loading using nvshmemx_culibrary_init.
- Added support for leveraging multiple Queue-Pairs (QPs) on LAG bonded NICs for RDMA transports. You can use the NVSHMEM_IB_NUM_RC_PER_DEVICE environment variable to tune this value as desired.
- Added support for randomizing QP assignment for multiple GPU endpoints when communicating over IBGDA transport.
- Added CUDA graph capture capabilities to on-stream collectives’ performance benchmarks using --cudagraph command-line parameter.
- Enabled host-side clang compilation support for NVSHMEM host library.
- Improved GPU thread-occupancy for on-stream fcollect when utilizing NVLS and LL algorithms by 30%.
- Improved multi-SM NVLS on-stream collectives to adapt gridDim as a function of NVLINK domain size.
- Improved runtime detection of CUDA VMM support and fall back to legacy pinned memory allocation cudaMalloc when platform support is not available for VMM.
- Improved resiliency of querying Global Identifier (GID) via sysfs for RoCE transports in containerized environment.
- Improved perftest presentation layer to provide additional count column capturing total number of elements per operation, independent of datatype size.
- Improved point-to-point signaling latency to always leverage CE-centric APIscuStreamWriteValue/cuStreamWaitValue by 20%.
Key Features
- Combines the memory of multiple GPUs into a partitioned global address space that’s accessed through NVSHMEM APIs
- Includes a low-overhead, in-kernel communication API for use by GPU threads
- Includes stream-based and CPU-initiated communication APIs
- Supports x86 and Arm processors
- Is interoperable with MPI and other OpenSHMEM implementations
NVSHMEM Advantages
Increase Performance
Convolution is a compute-intensive kernel that’s used in a wide variety of applications, including image processing, machine learning, and scientific computing. Spatial parallelization decomposes the domain into sub-partitions that are distributed over multiple GPUs with nearest-neighbor communications, often referred to as halo exchanges.
In the Livermore Big Artificial Neural Network (LBANN) deep learning framework, spatial-parallel convolution is implemented using several communication methods, including MPI and NVSHMEM. The MPI-based halo exchange uses the standard send and receive primitives, whereas the NVSHMEM-based implementation uses one-sided put, yielding significant performance improvements on Lawrence Livermore National Laboratory’s Sierra supercomputer.
Efficient Strong-Scaling on Sierra Supercomputer
Efficient Strong-Scaling on NVIDIA DGX SuperPOD
Accelerate Time to Solution
Reducing the time to solution for high-performance, scientific computing workloads generally requires a strong-scalable application. QUDA is a library for lattice quantum chromodynamics (QCD) on GPUs, and it’s used by the popular MIMD Lattice Computation (MILC) and Chroma codes.
NVSHMEM-enabled QUDA avoids CPU-GPU synchronization for communication, thereby reducing critical-path latencies and significantly improving strong-scaling efficiency.
Simplify Development
The conjugate gradient (CG) method is a popular numerical approach to solving systems of linear equations, and CGSolve is an implementation of this method in the Kokkos programming model. The CGSolve kernel showcases the use of NVSHMEM as a building block for higher-level programming models like Kokkos.
NVSHMEM enables efficient multi-node and multi-GPU execution using Kokkos global array data structures without requiring explicit code for communication between GPUs. As a result, NVSHMEM-enabled Kokkos significantly simplifies development compared to using MPI and CUDA.
Productive Programming of Kokkos CGSolve
Resources
- Users of NVSHMEM:
- NVSHMEM Blogs:
- Enhancing Application Portability and Compatibility across New Platforms Using NVIDIA Magnum IO NVSHMEM 3.0
- IBGDA: Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async
- Scaling Scientific Computing with NVSHMEM
- Accelerating NVSHMEM 2.0 Team-Based Collectives Using NCCL
- Introductory Webinar
- NVSHMEM Documentation
- NVSHMEM Best Practices Guide
- NVSHMEM API Documentation
- OpenSHMEM Specification
- NVSHMEM Developer Forum
- For questions or to provide feedback, please contact nvshmem@nvidia.com
- Related libraries and software:
Ready to start developing with NVSHMEM?