Magnum IO GDRCopy

Enable faster memory transfers between CPU and GPU with GDRCopy


GDRCopy is a low-latency GPU memory copy library based on GPUDirect RDMA technology that allows the CPU to directly map and access GPU memory. GDRCopy also provides optimized copy APIs and is widely used in high-performance communication runtimes like UCX, OpenMPI, MVAPICH, and NVSHMEM


magnum-io-cudaMemcpy-vs-GDRCopy-1809000-2.png

cudaMemcpy uses the GPU DMA Engines to move data between the CPU and GPU memories, which triggering the DMA Engines and results in latency overheads and lower performance for small data sizes. GDRCopy allows the CPU to directly access GPU memory through BAR mappings, allowing for low latency copies between GPU and CPU memories.



performance-h2d-and-d2h.jpeg

The benchmark test was run on an NVIDIA DGX-1V machine with CUDA10.1 and GPU Driver 418. The process was pinned to the CPU core that had affinity with the selected GPU. The pinned host memory was used as source and destination for H2D and D2H copy latency benchmarks respectively.




performance-ucx-latency.jpeg

Results stem from osu_lantency benchmark on two DGX-1V machines with CUDA10.1 and GPU Driver 418. The application processes were also pinned to the CPU cores with affinity with the selected GPU. UCX picks between GDRCopy and cudaMempy to provide the best performance for all data sizes.



Key Features

Very low latency for transferring data between host and device for small sizes, i.e. around 1 µs vs 7 µs with cudaMemcpy for host-to-device copies.

High host-to-device memory copy bandwidth, through write-combining (subject to NUMA effects) or cached mappings (on some POWER9-based platforms).


Caveats

GDRCopy APIs consume CPU resources, specifically CPU core cycles and H/W buffers, as opposed to cudaMemcpy which may offload the copy to the GPU Copy Engines (CE).

GDRCopy requires an extra kernel-mode driver (KMD) to be installed and loaded on the target machine. This can add extra complexity, especially when deploying in container environments.

See example

GDRCopy relies on GPUDirect RDMA, which is only available on Tesla and Quadro GPUs.

Read more


Ready to start using GDRCopy?

GDRCopy is distributed as linux packages and as an open source library.


Download now