NVSHMEM™ is a parallel programming interface based on OpenSHMEM that provides efficient and scalable communication for NVIDIA GPU clusters. NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA® streams.

Get Started

Existing communication models, such as Message-Passing Interface (MPI), orchestrate data transfers using the CPU. In contrast, NVSHMEM uses asynchronous, GPU-initiated data transfers, eliminating synchronization overheads between the CPU and the GPU.

Efficient, Strong Scaling

NVSHMEM enables long-running kernels that include both communication and computation, reducing overheads that can limit an application’s performance when strong scaling.

Low Overhead

One-sided communication primitives reduce overhead by allowing the initiating process or GPU thread to specify all information required to complete a data transfer. This low-overhead model enables many GPU threads to communicate efficiently.

Naturally Asynchronous

Asynchronous communications make it easier for programmers to interleave computation and communication, thereby increasing overall application performance.

What's New in NVSHMEM 2.11.0

  • This NVSHMEM release includes the following key features and enhancements:
  • Added experimental support for Multi-node NVLink (MNNVL) systems when all PEs are connected using the same NVLink network.
  • Added support for multiple ports for the same (or different) NICs for each PE in the IBGDA transport. This feature can be enabled using the NVSHMEM_IBGDA_ENABLE_MULTI_PORT runtime environment variable.
  • Added support for the sockets-based bootstrapping of NVSHMEM jobs using the nvshmemx_get_uniqueid and nvshmemx_set_attr_uniqueid_args unique ID-based initialization API routines and nvshmemx_init_attr (NVSHMEMX_INIT_WITH_UNIQUEID, &attr).
  • Added the nvshmemx_hostlib_init_attr API that allows the NVSHMEM host library-only initialization. This feature is useful for applications that use only the NVSHMEM host API and do not statically link the NVSHMEM device library.
  • Added support to dynamically link the NVSHMEM host library using dlopen().
  • Introduced nvshmemx_vendor_get_version_info, a new API that queries the runtime library version and allows checks against the NVSHMEM_VENDOR_MAJOR_VERSION, NVSHMEM_VENDOR_MINOR_VERSION, and NVSHMEM_VENDOR_PATCH_VERSION compile time constants.
  • Added the NVSHMEM_IGNORE_CUDA_MPS_ACTIVE_THREAD_PERCENTAGE runtime environment variable to get full API support with Multi-Process per GPU (MPG) runs even when CUDA_MPS_ACTIVE_THREAD_PERCENTAGE is not set to 1/PEs.
  • Improved the throughpout and bandwidth performance of the IBGDA transport.
  • Improved the performance of the nvshmemx_quiet_on_stream API with the IBGDA transport by leveraging multiple CUDA threads to perform the IBGDA quiet operation.
  • Enabled relaxed ordering by default for InfiniBand transports.
  • Added the NVSHMEM_IB_ENABLE_RELAXED_ORDERING runtime environment variable that can be set to 0 to disable relaxed ordering.
  • Increased the number of threads that are launched to execute the nvshmemx___reduce_on_stream API.
  • Added the NVSHMEM_DISABLE_DMABUF runtime environment variable to disable dmabuf usage.
  • Added a fix in the IBGDA transport that allows message transfers that are larger than the maximum size that is supported by a NIC work request.
  • Includes code refactoring and bug fixes.

Key Features

  • Combines the memory of multiple GPUs into a partitioned global address space that’s accessed through NVSHMEM APIs
  • Includes a low-overhead, in-kernel communication API for use by GPU threads
  • Includes stream-based and CPU-initiated communication APIs
  • Supports x86 and Arm processors
  • Is interoperable with MPI and other OpenSHMEM implementations

NVSHMEM Advantages

Increase Performance

Convolution is a compute-intensive kernel that’s used in a wide variety of applications, including image processing, machine learning, and scientific computing. Spatial parallelization decomposes the domain into sub-partitions that are distributed over multiple GPUs with nearest-neighbor communications, often referred to as halo exchanges.

In the Livermore Big Artificial Neural Network (LBANN) deep learning framework, spatial-parallel convolution is implemented using several communication methods, including MPI and NVSHMEM. The MPI-based halo exchange uses the standard send and receive primitives, whereas the NVSHMEM-based implementation uses one-sided put, yielding significant performance improvements on Lawrence Livermore National Laboratory’s Sierra supercomputer.

Efficient Strong-Scaling on Sierra Supercomputer

Efficient Strong-Scaling on NVIDIA DGX SuperPOD

Accelerate Time to Solution

Reducing the time to solution for high-performance, scientific computing workloads generally requires a strong-scalable application. QUDA is a library for lattice quantum chromodynamics (QCD) on GPUs, and it’s used by the popular MIMD Lattice Computation (MILC) and Chroma codes.

NVSHMEM-enabled QUDA avoids CPU-GPU synchronization for communication, thereby reducing critical-path latencies and significantly improving strong-scaling efficiency.

Watch the GTC 2020 Talk

Simplify Development

The conjugate gradient (CG) method is a popular numerical approach to solving systems of linear equations, and CGSolve is an implementation of this method in the Kokkos programming model. The CGSolve kernel showcases the use of NVSHMEM as a building block for higher-level programming models like Kokkos.

NVSHMEM enables efficient multi-node and multi-GPU execution using Kokkos global array data structures without requiring explicit code for communication between GPUs. As a result, NVSHMEM-enabled Kokkos significantly simplifies development compared to using MPI and CUDA.

Productive Programming of Kokkos CGSolve

Ready to start developing with NVSHMEM?

Get Started