Simulation / Modeling / Design

Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async

Today’s leading-edge high performance computing (HPC) systems contain tens of thousands of GPUs. In NVIDIA systems, GPUs are connected on nodes through the NVLink scale-up interconnect, and across nodes through a scale-out network like InfiniBand. The software libraries that GPUs use to communicate, share work, and efficiently operate in parallel are collectively called NVIDIA Magnum IO, the architecture for parallel, asynchronous, and intelligent data center IO.

For many applications, scaling to such large systems requires high efficiency for fine-grain communication between GPUs. This is especially critical for workloads targeting strong scaling, where computing resources are added to reduce the time to solve a given problem.

NVIDIA Magnum IO NVSHMEM is a communication library that is based on the OpenSHMEM specification and provides a partitioned global address space (PGAS) data access model to the memory of all GPUs in an HPC system.

This library is an especially suitable and efficient tool for workloads targeting strong scaling because of its support for GPU-integrated communication. In this model, data is accessed through one-sided read, write, and atomic update communication routines.

This communication model achieves a high efficiency for fine-grain data access through NVLink because of the tight integration with the GPU architecture. However, high efficiency for internode data access has remained a challenge because of the need for the host CPU to manage communication operations.

This post introduces a new communication methodology in NVSHMEM called InfiniBand GPUDirect Async (IBGDA) built on top of the GPUDirect Async family of technologies. IBGDA was introduced in NVSHMEM 2.6.0 and significantly improved upon in NVSHMEM 2.7.0 and 2.8.0. It enables the GPU to bypass the CPU when issuing internode NVSHMEM communication without any changes to existing applications. As we show, this leads to significant improvements in throughput and scaling for applications using NVSHMEM.

Proxy-initiated communication

Control flow diagram with four components: GPU, host memory, CPU, and NIC. The diagram marks the control flow with numbers corresponding to the sequence of operations described in this section.
Figure 1. GPU communications using the CPU proxy to initiate NIC communications causes communication bottlenecks

Using NVLink for intranode communication can be achieved through GPU streaming multiprocessor (SM)–initiated load and store instructions. However, internode communication involves submitting a work request to a network interface controller (NIC) to perform an asynchronous data transfer operation.

Before the introduction of IBGDA, the NVSHMEM InfiniBand Reliable Connection (IBRC) transport used a proxy thread on the CPU to manage communication (Figure 1). When using a proxy thread, NVSHMEM performs the following sequence of operations:

  1. The application launches a CUDA kernel that produces data in GPU memory.
  2. The application calls an NVSHMEM operation (such as nvshmem_put) to communicate with another processing element (PE). This operation can be called from within a CUDA kernel when performing fine-grain or overlapped communication. The NVSHMEM operation writes a work descriptor to the proxy buffer, which is in the host memory.
  3. The NVSHMEM proxy thread detects the work descriptor and initiates the corresponding network operation.

The following steps describe the sequence of operations performed by the proxy thread when interacting with an NVIDIA InfiniBand host channel adapter (HCA), such as the ConnectX-6 HCA:

  1. The CPU creates a work descriptor and enqueues it on the work queue (WQ) buffer, which resides in the host memory.
  2. This descriptor indicates the requested operation such as an RDMA write, and contains the source address, destination address, size, and other necessary network information.
  3. The CPU updates the doorbell record (DBR) buffer in the host memory. This buffer is used in the recovery path in case the NIC drops the write to its doorbell (DB).
  1. The CPU notifies the NIC by writing to its DB, which is a register in the NIC hardware.
  2. The NIC reads the work descriptor from the WQ buffer.
  3. The NIC directly copies the data from the GPU memory using GPUDirect RDMA.
  4. The NIC transfers the data to the remote node.
  5. The NIC indicates that the network operation is completed by writing an event to the completion queue (CQ) buffer on the host memory.
  6. The CPU polls on the CQ buffer to detect completion of the network operation.
  7. The CPU notifies the GPU that the operation has completed. If GDRCopy is present, it writes a notification flag to the GPU memory directly. Otherwise, it writes that flag to the proxy buffer. The GPU polls on the corresponding memory for the status of the work request.

While this approach is portable and can provide high bandwidth for bulk data transfers, it has two major drawbacks:

  • CPU cycles are continuously consumed by the proxy thread.
  • You cannot reach the peak NIC throughput for fine-grain transfers because of a bottleneck at the proxy thread. Modern NICs can process hundreds of millions of communication requests per second. While the GPU can generate requests at this rate, the CPU proxy’s processing rate is orders of magnitude lower, creating a bottleneck for fine-grain communication patterns.

InfiniBand GPUDirect Async

Control flow diagram shows two components: GPU and NIC, and how a GPU SM submits work descriptors to the NIC. The control flow arrows are marked with numbers, which are described in this section.
Figure 2. GPU communications using IBGDA enables a direct control path from GPU SM to NIC and removes the CPU from the critical path

In contrast with proxy-initiated communication, IBGDA uses GPUDirect Async–Kernel-Initiated (GPUDirect Async–KI) to enable the GPU SM to interact directly with NIC. This is shown in Figure 2 and involves the following steps.

  1. The application launches a CUDA kernel that produces data in the GPU memory.
  2. The application calls an NVSHMEM operation (such as nvshmem_put) to communicate with another PE. The NVSHMEM operation uses an SM to create a NIC work descriptor and writes it directly to the WQ buffer. Unlike the CPU Proxy method, this WQ buffer resides in the GPU memory.
  3. The SM updates the DBR buffer, which is also located in the GPU memory.
  4. The SM notifies the NIC by writing to the NIC’s DB register.
  5. The NIC reads the work descriptor in the WQ buffer using GPUDirect RDMA.
  6. The NIC reads the data in the GPU memory using GPUDirect RDMA.
  7. The NIC transfers the data to the remote node.
  8. The NIC notifies the GPU that the network operation is completed by writing to the CQ buffer using GPUDirect RDMA.

As shown, IBGDA eliminates the CPU from the communication control path. When using IBGDA, GPU and NIC directly exchange information necessary for communication. The WQ and DBR buffers are also moved to the GPU memory to improve efficiency when accessed by the SM, while preserving access by the NIC through GPUDirect RDMA.

Magnum IO NVSHMEM evaluation

We compared the performance of the NVSHMEM IBGDA transport with the NVSHMEM IBRC transport, which uses a proxy thread to manage communication. Both transports are part of the standard NVSHMEM distribution. All benchmarks and the case study were run on four DGX-A100 servers connected through NVIDIA ConnectX-6 200 Gb/s InfiniBand networking and an NVIDIA Quantum HDR switch.

To highlight the effect of IBGDA, we disabled communication through NVLink. This forces all transfers to be performed through the InfiniBand network even when PEs are located on the same node.

One-sided put bandwidth

We first ran the shmem_put_bw benchmark, which is included in the NVSHMEM performance test suite and uses nvshmem_double_put_nbi_block to issue data transfers. This test measures the bandwidth achieved when using one-sided write operations to transfer a fixed amount of total data over a range of communication parameters.

For internode transfers, this operation uses one thread in the thread block when performing network communication, regardless of how many threads are in the thread block. This is known and also referred to as a cooperative thread array (CTA). Two PEs were launched on different DGX-A100 nodes. This was set with one thread per thread block and one QP (NIC queue pair, containing the WQ and CQ) per thread block.

Graph x-axis depicts message sizes, while the y-axis represents bandwidth. Both axes are in base-2 logarithmic scale. Each line portrays the bandwidth at various message sizes of a given number of CTAs and QPs, which results in a roofline pattern. The graph reveals that IBRC cannot increase the bandwidth of small messages as the number of CTAs and QPs increase beyond four.
Figure 3. The bandwidth of shmem_put_bw with IBRC shows the bandwidth cap caused by CPU Proxy for small message sizes as you scale to more QPs and CTAs
Graph x-axis depicts message sizes, while the y-axis represents bandwidth. Both axes are in base-2 logarithmic scale. Each line portrays the bandwidth at various message sizes of a given number of CTAs and QPs, which results in a roofline pattern. The graph shows that the bandwidth of small message sizes increases as you scale the number of CTAs and QPs.
Figure 4. The bandwidth of shmem_put_bw with IBGDA demonstrates that the bandwidth of small message sizes can scale as the number of CTAs and QPs increases

Figures 3 and 4 show the bandwidth of shmem_put_bw with IBRC and IBGDA at various numbers of CTAs and message sizes. As shown, for coarse-grain communication with large messages, both IBGDA and IBRC can reach peak bandwidth. IBRC can saturate the network with messages as small as 16 KiB when the application issues communication from at least four CTAs.

Increasing the number of CTAs further does not reduce the minimum message size at which we observed peak bandwidth. The bottleneck limiting bandwidth for smaller messages is in the CPU Proxy thread. Although not shown here, we also tried increasing the number of CPU proxy threads and observed similar behavior.

By removing the proxy bottleneck, IBGDA achieved the peak bandwidth with messages as small as 2 KiB when 64 CTAs issue communication. This result highlights IBGDA’s ability to support a higher level of communication parallelism and the resulting performance improvement.

For IBRC and IBGDA, only one thread in each CTA participates in the network operations. In other words, only 64 threads (not 1,024×64 threads) are required to achieve peak bandwidth at 2 KiB message size.

It is also shown that IBGDA bandwidth continues to scale with the number of CTAs performing communication, whereas the IBRC proxy reaches its scaling limit at four CTAs. As a result, IBGDA provides up to 9.5x higher throughput for NVSHMEM block-put operations with message sizes less than 1 KiB.

Boosting throughput for small messages

The shmem_p_bw benchmark uses the scalar nvshmem_double_p operation to send data directly from a GPU register to the remote PE. This operation is thread-scoped, which means that each thread that calls this operation transfers an 8-byte data word.

In the following experiment, we launched 1,024 threads per CTA and increased the number of CTAs while keeping the number of QPs equal to the number of CTAs.

Graph shows the put rate of shmem_p_bw. The x-axis represents the number of CTAs, while the y-axis represents the put rate in million operations per second (MOPS). As you scale the number of CTAs, the put rate of IBRC is capped around 1.7 MOPS.
Figure 5. The put rate comparison between IBRC and IBGDA of shmem_p_bw shows the performance advantage of IBGDA on sending millions of small messages per second

On the other hand, the put rate of IBGDA could reach up to 180 MOPS, approaching the peak NIC message rate limit at 215 MOPS. The graph also shows that IBGDA could reach almost 2000 MOPS if the coalescing conditions are satisfied.

Figure 5 shows that the put rate, in million operations per second (MOPS), of IBRC is capped at around 1.7 MOPS regardless of the number of CTAs and QPs. On the other hand, the message rate of IBGDA increases with the number of CTAs, approaching the 215 MOPS hardware limit of the NVIDIA ConnectX-6 InfiniBand NIC with just eight CTAs.

In this configuration, IBGDA issues one work request to the NIC per nvshmem_double_p operation. This highlights an advantage of IBGDA for fine-grain communication involving large numbers of small messages.

IBGDA also provides automatic data coalescing when the destination addresses are contiguous within the same warp. This feature enables sending one large message instead of 32 small messages. It is useful for applications that want to transfer scattered data directly from GPU registers to a contiguous buffer at the destination.

Figure 5 shows that the put rate could reach beyond the NIC peak message rate when the data coalescing conditions are satisfied.

Jacobi method case study

The performance of the NVSHMEM Jacobi benchmark was analyzed to demonstrate the performance of IBGDA in real applications compared with IBRC. This repository includes two NVSHMEM implementations of the Jacobi solver.

In the first implementation, each thread uses the scalar nvshmem_p operation to send data as soon as it becomes available. This implementation is known to work well with NVLink but not with IBRC.

The second implementation aggregates data into a contiguous GPU buffer before each CTA calls nvshmem_put_nbi_block to initiate the communication. This data aggregation technique works well with IBRC but adds overhead on NVLink, where nvshmem_p operations can directly store data from a register to the remote PE’s buffer. This mismatch highlights a challenge when optimizing a given code for both scaling up and scaling out.

Graph with a matrix size of 32k × 32k and an element size is 4 bytes. The P implementation with IBRC has the latency starting from 40 seconds and increasing as you scale to more PEs. On the other hand, the latency of the P implementation with IBGDA starts from around 5 seconds and decreases as you add more PEs. This is similar to the latency of the PUT implementation of both IBRC and IBGDA, as well as the P implementation with NVLink.
Figure 6. Latency comparison of Jacobi implementations using P and PUT with IBRC and IBGDA

Figure 6 shows that IBGDA’s improvements to small message communication efficiency can help address these challenges. This chart shows the latency of 1,000 iterations of the Jacobi kernel for a strong scaling experiment where the number of PEs is increased while maintaining a fixed matrix size. With IBRC, the nvshmem_p version of Jacobi has more than 8x the latency of the nvshmem_put version.

On the other hand, both the nvshmem_p and nvshmem_put versions scale with IBGDA and match the efficiency of nvshmem_p on NVLink. The IBGDA nvshmem_p version matches the latency of nvshmem_put with IBRC.

Results show that nvshmem_p with IBGDA has a slightly higher latency compared with nvshmem_put. This is because sending one large message incurs lower network overhead compared with sending many small messages.

While such overheads are fundamental to networking, IBGDA can enable applications to hide them by submitting many small message transfer requests to the NIC in parallel.

All-to-all latency case study

Graph shows that for message sizes less than 8 KiB, the all-to-all latency of IBRC fluctuates between 128 us to 256 us, while that of IBGDA is consistent around 64 us.
Figure 7. Latency comparison of 32 PE All-to-All transfer between IBRC and IBGDA

Figure 7 shows the latency of the NVSHMEM all-to-all collective operation for IBRC and IBGDA, highlighting the small message performance benefits of IBGDA.

With IBRC, the proxy thread is a serialization point for all operations coming from the device. The proxy thread processes requests in batches to reduce overheads. However, depending on when operations are submitted to the device, there is a possibility that operations submitted at nearly the same time on the device will be processed by a separate loop of the proxy thread.

The serialization of operations by the proxy creates additional latency and masks the internal parallelism of both the NIC and GPU. The IBGDA results show an overall latency that is more consistent, especially for message sizes lower than 16 KiB.

Magnum IO NVSHMEM improves network performance

In this blog, we’ve shown how Magnum IO improves small message network performance, especially for large applications deployed across hundreds or thousands of nodes in HPC data centers.  NVSHMEM 2.6.0 introduced InfiniBand GPUDirect Async, which enables the GPU’s SM to submit communication requests directly to the NIC, bypassing the CPU for network communication on NVIDIA InfiniBand networks.

In comparison with the proxy method for managing communication, IBGDA can sustain significantly higher throughput rates at much smaller message sizes. These performance improvements are especially critical to applications that require strong scaling and tend to have message sizes that shrink as the workload is scaled to larger numbers of GPUs.

IBGDA also closes the small message throughput gap between NVLink and network communication, making it easier for you to optimize code to both scale-up and scale-out on today’s GPU-accelerated HPC systems.

Discuss (2)