Networking

Enhancing Application Portability and Compatibility across New Platforms Using NVIDIA Magnum IO NVSHMEM 3.0

NVSHMEM is a parallel programming interface that provides efficient and scalable communication for NVIDIA GPU clusters. Part of NVIDIA Magnum IO and based on OpenSHMEM, NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA streams.

Existing communication models, such as the Message Passing Interface (MPI), orchestrate data transfers using the CPU. In contrast, NVSHMEM uses asynchronous, GPU-initiated data transfers, eliminating synchronization overheads between the CPU and the GPU.

This post dives into the details of the NVSHMEM 3.0 release, including new features and support that we are enabling across platforms and systems.

Two workflow diagrams depict the difference between an MPI workflow and a NVSHMEM workflow. The MPI workflow depicts single in and out data streams from the GPU to the network, while the NVSHMEM workflow shows multiple (parallel) streams directly from the GPU to the network.
Figure 1. NVSHMEM and MPI comparison

New features and interface support in NVSHMEM 3.0

NVSHMEM 3.0 introduces multi-node, multi-interconnect support, host-device ABI backward compatibility, and CPU-assisted InfiniBand GPU Direct Async (IBGDA).

Multi-node, multi-interconnect support 

Previously, NVSHMEM has supported connectivity between multiple GPUs within a node over P2P interconnect (NVIDIA NVLink/PCIe) and multiple GPUs across a node over RDMA interconnects, such as InfiniBand, RDMA over Converged Ethernet (RoCE), Slingshot, and so on (Figure 2).

This is a topological diagram with four compute nodes containing dual-GPUs connected by NVSwitch and each with a NIC, on the bottom layer, directly interconnected via the NICs in a criss-cross fashion to two Infiniband switch boxes above.
Figure 2. A topology view of multiple nodes connected over RDMA networks

NVSHMEM 2.11 added support for Multi-Node NVLink (MNNVL or NVIDIA GB200 NVL72) systems (Figure 3). However, this support was limited to when NVLink was the only inter-node interconnect. 

This is an architectural diagram with 18 compute nodes arranged in a matrix format, each node containing 4 GPUs and 2 CPUs, interconnected by 9 NVLink Switch trays, each with 2 NVLink Switch chips, in the middle, to show how all 72 nodes are connected within a single NVIDIA GB200 NVL72 rack cabinet.
Figure 3. A topology view of a single NVIDIA GB200 NVL72 rack

To address this limitation, new platform support was added to enable multiple racks of NVIDIA GB200 NVL72 systems connected to each other through RDMA networks (Figure 4). NVSHMEM 3.0 adds this platform support such that when two GPUs are part of the same NVLink fabric (for example, within the same NVIDIA GB200 NVL72 rack), NVLink will be used for communication. 

In addition, when the GPUs are spread across NVLink fabrics (for example, across NVIDIA GB200 NVL72 racks), the remote network will be used for communication between those GPUs. This release also enhances the NVSHMEM_TEAM_SHARED capabilities to contain all GPUs that are part of the same NVLink clique that spans one or more nodes.

An architectural diagram showing many NVIDIA GB200 NVL72 racks (as shown in the Figure 3 diagram), all interconnected by multiple InfiniBand box switches above them.
Figure 4. A topology view of multiple NVIDIA GB200 NVL72 racks connected over RDMA networks

Host-device ABI backward compatibility 

Historically, NVSHMEM has not supported backwards compatibility for applications or independently compiled bootstrap plug-ins. NVSHMEM 3.0 introduces backwards compatibility across NVSHMEM minor versions.  An ABI breakage will be denoted by a change in the major version of NVSHMEM. 

This feature enables the following use cases for libraries or applications consuming ABI compatible versions of NVSHMEM:

  • Libraries linked to a minor version 3.X of NVSHMEM can be installed on a system with a newer installed version 3.Y of NVSHMEM (Y > X).
  • Multiple libraries shipped together in an SDK, which link to different minor versions of NVSHMEM, will be supported by a single newer version of NVSHMEM.
  • A CUDA binary (also referred to as cubin) statically linked to an older version of the NVSHMEM device library can be loaded by a library using a newer version of NVSHMEM.
NVSHMEM Host Library 2.12NVSHMEM Host Library 3.0​NVSHMEM Host Library 3.1NVSHMEM Host Library 3.2NVSHMEM Host Library 4.0
Application linked to NVSHMEM 3.1NoNoYesYesNo
Cubin linked to NVSHMEM 3.0NoYesYesYesNo
Multiple Libraries
Lib 1 NVSHMEM 3.1​
Lib 2 NVSHMEM 3.2
NoNoNoYesNo
Table 1. Compatibility for NVSHMEM 3.0 and future versions

CPU-assisted InfiniBand GPU Direct Async 

In previous releases, NVSHMEM supported traditional InfiniBand GPU Direct Async (IBGDA), where the GPU directly drives the IB NIC, enabling massively parallel control plane operations. It is responsible for managing the Network Interface Card (NIC) control plane, such as ringing the doorbell when a new work request is published to the NIC. 

NVSHMEM 3.0 adds support for a new modality in IBGDA called CPU-assisted IBGDA, which acts as an intermediate mode between proxy-based networking and traditional IBGDA. It splits responsibilities of the control plane between the GPU and CPU. The GPU generates work requests (control plane operations) and the CPU manages NIC doorbell-ringing requests for submitted work requests. It also enables dynamic selections of the NIC assistant to be CPU or GPU at runtime. 

CPU-assisted IBGDA relaxes existing administrative-level configuration constraints in IBGDA peer mapping, thereby helping to improve IBGDA adoption on non-coherent platforms where administrative level configuration constraints are harder to enforce in large-scale cluster deployments.

There are two block-based diagrams showing a comparison between traditional Infiniband GPU Direct Async (IBGDA) communications versus CPU-assisted Infiniband GPU Direct Async communications. The traditional IBGDA diagram shows a green block on top and a red block below it connected together by a line. While the CPU-assisted IBGDA diagram shows the same two blocks, but this time with a blue CPU block drawn in the middle.
Figure 5. Comparison between traditional and CPU-assisted IBGDA

Non-interface support and minor enhancements

NVSHMEM 3.0 also introduces minor enhancements and non-interface support, as detailed in this section. 

Object-oriented programming framework for symmetric heap 

Historically, NVSHMEM has supported multiple symmetric heap kinds using a procedural programming model. This has limitations, such as a lack of namespace-based data encapsulation and code duplication and data redundancy. 

NVSHMEM 3.0 introduces support for an object-oriented programming (OOP) framework that can manage different kinds of symmetric heaps such as static device memory and dynamic device memory using multi-level inheritance. This will enable easier extension to advanced features like on-demand registration of application buffers to symmetric heap in future releases.

Diagram showing five boxes: two green stacked on the left, and two red stacked on the right, with a single black box at the top middle. All five boxes have bidirectional lines between them and reach the top black box. The green and red boxes are not directly connected, but instead connect through the black box in a clockwise direction.
Figure 6. NVSHMEM 3.0 object-oriented hierarchy

Performance improvements and bug fixes

NVSHMEM 3.0 introduces various performance improvements and bug fixes for different components and scenarios. These include IBGDA setup, block scoped on-device reductions, system scoped atomic memory operation (AMO), IB Queue Pair (QP) mapping, memory registration, team management, and PTX build testing.

Summary

The 3.0 release of the NVIDIA NVSHMEM parallel programming interface introduces several new features, including multi-node multi-interconnect support, host-device ABI backward compatibility, CPU-assisted InfiniBand GPU Direct Async (IBGDA), and object-oriented programming framework for symmetric heap.

With host-device ABI backward compatibility, administrators can update to a new version of NVSHMEM without breaking already compiled applications, eliminating the need to modify the application source with each update.

CPU-assisted InfiniBand GPU Direct Async (IBGDA) enables users to benefit from the high message rate of the IBGDA transport on clusters where enforcing administrative level driver settings is not possible.

To learn more and get started, see the following resources:

Discuss (0)

Tags