NVIDIA DOCA GPUNetIO is a library within the NVIDIA DOCA SDK, specifically designed for real-time inline GPU packet processing. It combines technologies like GPUDirect RDMA and GPUDirect Async to enable the creation of GPU-centric applications where a CUDA kernel can directly communicate with the network interface card (NIC) for sending and receiving packets, bypassing the CPU and excluding it from the critical path.
The core principles and uses of DOCA GPUNetIO have been discussed in the previous posts, Inline GPU Packet Processing with NVIDIA DOCA GPUNetIO and Realizing the Power of Real-Time Network Processing with NVIDIA DOCA GPUNetIO, as well as in the DOCA GPUNetIO programming guide.
Previously, DOCA GPUNetIO, alongside DOCA Ethernet and DOCA Flow, was limited to handling packet transmissions over the Ethernet transport layer. With the introduction of DOCA 2.7, an expanded set of APIs now enables DOCA GPUNetIO to support RDMA communications directly from a GPU CUDA kernel using RoCE or InfiniBand transport layers.
This post explores the new Remote Direct Memory Access (RDMA) functionalities controlled by a GPU CUDA kernel with DOCA GPUNetIO and presents a performance comparison with the performance test (perftest) microbenchmarks.
Note that the RDMA acronym describes the protocol to enable remote direct memory access from the memory of one computer into that of another without involving the operating system of either one. Examples of operations include RDMA Write and RDMA Read. It must not be confused with GPUDirect RDMA, which is not related to the RDMA protocol. GPUDirect RDMA is one of the technologies enabled by NVIDIA in the GPUDirect family of technologies. It enables the network card to send or receive data directly accessing the GPU memory bypassing the CPU memory copies and Operating System routines. GPUDirect RDMA can be enabled by any network framework working with Ethernet, InfiniBand, or RoCE.
RDMA GPU data path with GPUNetIO
RDMA provides direct access between the main memory of two hosts without involving the operating system, cache, or storage. This enables data transfers with high throughput, low latency, and low CPU utilization. This is accomplished by registering and sharing a local memory area with the remote host (or peer) so the remote host knows how to access it.
An application where two peers need to exchange data over RDMA typically follows three fundamental steps:
- Step 1 – Local configuration: Each peer creates locally the RDMA queue(s) and memory buffers to share with the other peer.
- Step 2 – Exchange info: Using an out-of-band (OOB) mechanism (for example, Linux sockets) peers exchange info about RDMA queue(s) and memory buffers to be accessed remotely.
- Step 3 – Data path: The two peers execute an RDMA Read/Write/Send/Recv to exchange data using remote memory addresses.
The DOCA RDMA library enables RDMA communications over InfiniBand or RoCE following the three steps listed above, all executed with the CPU. With the introduction of the new GPUNetIO RDMA functions, the application can manage the data path of the RDMA application on the GPU executing Step 3 with a CUDA kernel instead of CPU, while Steps 1 and 2 remain the same, as they are not relevant for the GPU data path.
The benefits of moving the RDMA data path onto the GPU are the same as in the Ethernet use case. In network applications where the data processing happens on the GPU, offloading the network communications from the CPU to the GPU enables it to be the main controller of the application, removing the extra latency needed to interact with the CPU to know when data is ready and where it’s located. This also frees the CPU cycles. Additionally, multiple RDMA queues can be managed in parallel by the GPU at the same time. For example, each CUDA block can post RDMA operations on a different RDMA queue.
IB Verbs and DOCA GPUNetIO perftests
In DOCA 2.7, a new DOCA GPUNetIO RDMA client-server code sample has been introduced to show the usage of the new API and evaluate correctness. This post analyzes the performance comparison between GPUNetIO RDMA functions and IB Verbs RDMA functions reproducing one of the microbenchmarks in the well-known perftest suite.
In a nutshell, perftest is a collection of microbenchmarks to measure RDMA bandwidth (BW) and latency between two peers (server and client) using basic RDMA operations. Although the network control part happens in the CPU, it’s possible to specify whether data resides in the GPU memory by enabling GPUDirect RDMA with the --use_cuda flag.
As a generic overview, the RDMA write unidirectional BW benchmark (namely ib_write_bw) posts, on every RDMA queue, a list of write requests for messages of the same size for a fixed number of iterations and commands the NIC to execute the posted writes. This is the so-called “ring the doorbell” procedure. To ensure all the writes have been issued, before moving to the next iteration, it polls the completion queue waiting for the confirmation that every write has been executed correctly. For every message size, it is then possible to retrieve the total time spent to post and poll and calculate the BW in MB/s.
Figure 1 shows the IB Verbs ib_write_bw perftest main loop. At each iteration, the CPU posts a list of RDMA write requests, commands the NIC to execute them (ringing the doorbell) and waits for the completion before moving to the next iteration. With the CUDA flag enabled, packets to write are fetched locally from GPU memory instead of CPU memory.

ib_write_bw perftest main loopThe experiment was to reproduce the ib_write_bw microbenchmark with the DOCA library, using DOCA RDMA for the control path on the CPU to establish the client-server connection and DOCA GPUNetIO RDMA for the data path, posting the writes within a CUDA kernel. The comparison is not apples-to-apples, as perftest uses GPUDirect RDMA to transfer data but network communications are controlled by the CPU, whereas DOCA GPUNetIO uses both GPUDirect RDMA and GPUDirect Async to control network communications and data transfers from the GPU. The goal is to prove DOCA GPUNetIO RDMA performance is comparable with IB Verbs perftest, which is considered the baseline.
To reproduce the ib_write_bw data path and measure the time taken to post the RDMA write operations for each message size, the CPU records a CUDA event, launches the rdma_write_bw CUDA kernel, and then records a second CUDA event. This should give a good approximation of the time in milliseconds taken by the CUDA kernel to post the RDMA writes using DOCA GPUNetIO functions, as shown in Code Snippet 1 below.
Int msg_sizes[MAX_MSG] = {....};
for (int msg_idx = 0; msg_idx < MAX_MSG; msg_idx++) {
         do_warmup();
         cuEventRecord(start_event, stream);
         rdma_write_bw<<<num_queue, msg_per_size, 0, stream>>>(msg_sizes[msg_idx], …);
         cuEventRecord(end_event, stream);
         cuEventSynchronize(end_event);
         cuEventElapsedTime(&total_ms, start_event, end_event);
         calculate_result(total_ms, msg_sizes[msg_idx], …)
}
In Code Snippet 2 below, the CUDA kernel rdma_write_bw, for a given number of iterations, posts in parallel a sequence of RDMA writes using DOCA GPUNetIO device functions following the weak pattern. Each CUDA thread in the CUDA block posts a write operation.
__global__ void rdma_write_bw(struct doca_gpu_dev_rdma *rdma_gpu,
               const int num_iter, const size_t msg_size,
               const struct doca_gpu_buf_arr *server_local_buf_arr,
               const struct doca_gpu_buf_arr *server_remote_buf_arr)
{
   struct doca_gpu_buf *remote_buf;
   struct doca_gpu_buf *local_buf;
   uint32_t curr_position;
   uint32_t mask_max_position;
   doca_gpu_dev_buf_get_buf(server_local_buf_arr, threadIdx.x, &local_buf);
   doca_gpu_dev_buf_get_buf(server_remote_buf_arr, threadIdx.x, &remote_buf);
   for (int iter_idx = 0; iter_idx < num_iter; iter_idx++) {
       doca_gpu_dev_rdma_get_info(rdma_gpu, &curr_position, &mask_max_position);
       doca_gpu_dev_rdma_write_weak(rdma_gpu,
                   remote_buf, 0,
                   local_buf, 0,
                   msg_size, 0,
                   DOCA_GPU_RDMA_WRITE_FLAG_NONE,
                   (curr_position + threadIdx.x) & mask_max_position);
       /* Wait all CUDA threads to post their RDMA Write */
       __syncthreads();
       if (threadIdx.x == 0) {
           /* Only 1 CUDA thread can commit the writes in the queue to execute them */
           doca_gpu_dev_rdma_commit_weak(rdma_gpu, blockDim.x);
                /* Only 1 CUDA thread can flush the RDMA queue waiting for the actual execution of the writes */
       doca_gpu_dev_rdma_flush(rdma_gpu);
       }
       __syncthreads();
   }
   return;
}
Figure 2 depicts Code Snippet 2. At every iteration, the GPU CUDA kernel posts in parallel a list of RDMA Write requests (one per CUDA thread in the CUDA block). After synchronizing all CUDA threads, only thread 0 commands the NIC to execute (commit) the writes and waits for the completion (flush the queue) before moving to the next iteration.

To compare the performance, the same parameters have been set for IB Verbs perftest and DOCA GPUNetIO perftest: 1 RDMA queue, 2,048 iterations, 512 RDMA writes per iteration, and testing message sizes from 64 to 4,096 bytes.
Benchmarks over RoCE have been executed on two systems with different PCIe topologies:
- System 1: HPE ProLiant DL380 Gen11 system with an NVIDIA GPU L40S and a BlueField-3 card in NIC mode, Intel Xeon Silver 4410Y CPU. GPU and NIC are attached to two different PCIe slots on the same NUMA node (no dedicated PCIe switch)
- System 2: Dell R750 system with an NVIDIA H100 GPU and ConnectX-7 network card, Intel Xeon Silver 4314 CPU. GPU and NIC are attached to two different PCIe slots on different NUMA nodes (worst scenario for GPUDirect applications).
As shown by the following figures, the peak BW (reported in MB/s) reached by the two implementations of perftest are totally comparable on both systems (Figure 3 and Figure 4). Specifically on Figure 3, the DOCA GPUNetIO perftest BW is better than the DOCA GPUNetIO perftest BW reported in Figure 4 due to the different topology on the system which not only affects the data movements from GPU memory to the network (GPUDirect RDMA) but also the internal communication between the GPU and the NIC control the RDMA communications (GPUDirect Async). Due to the nature of the different logic in the code, time (and thus the BW) is measured with different methodologies: the IB Verbs perftest uses the system clock, whereas the DOCA GPUNetIO perftest relies on CUDA events, which may have a different internal time measurement overhead.


Note that an application like perftest is not the best tool to show the benefit for GPU utilization, as the amount of achievable parallelization is quite low. DOCA GPUNetIO perftest RDMA writes are posted in parallel (512 writes, each by a different CUDA thread) in the queue, but the time taken by posting is trivial (~4 microseconds). The majority of perftest time is taken by the NIC actually executing the RDMA writes, sending data over the network, and returning positive feedback.
This experiment can be considered a success, as it proves that using the DOCA GPUNetIO RDMA API doesn’t introduce any relevant overhead in comparison to using regular IB Verbs and performance targets are met running the same type of workload and workflow. ISV developers and end users can use DOCA GPUNetIO RDMA, gaining the benefits of GPUDirect async technology offloading to the GPU the control of the communication.
This architectural choice offers the following benefits:
- A more scalable application, capable of managing at the same time multiple RDMA queues in parallel (typically one queue per CUDA block).
- Ability to take advantage of the high degree of parallelism offered by the GPU having several CUDA threads working in parallel on different data and posting RDMA operations on the same queue at the lowest latency possible.
- Lower CPU utilization, making the solution platform-independent (different CPU architectures don’t result in significant performance variations).
- Fewer internal bus transactions (PCIe, for example), as there is no need to synchronize the work on the GPU with the CPU activity. The CPU is no longer responsible for sending or receiving data the GPU must process.
Ready to dive deeper into DOCA GPUNetIO? Check out the official DOCA GPUNetIO Programmer’s Guide for comprehensive insights and technical details.
To learn more about the power of DOCA GPUNetIO, see Inline GPU Packet Processing with NVIDIA DOCA GPUNetIO. For information about how DOCA GPUNetIO revolutionizes GPU communication, see Realizing the Power of Real-Time Network Processing with NVIDIA DOCA GPUNetIO.
 
         
           
     
     
     
     
     
     
     
     
    