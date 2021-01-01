cudaMemcpy uses the GPU DMA Engines to move data between the CPU and GPU memories, which triggering the DMA Engines and results in latency overheads and lower performance for small data sizes. GDRCopy allows the CPU to directly access GPU memory through BAR mappings, allowing for low latency copies between GPU and CPU memories.

The benchmark test was run on an NVIDIA DGX-1V machine with CUDA10.1 and GPU Driver 418. The process was pinned to the CPU core that had affinity with the selected GPU. The pinned host memory was used as source and destination for H2D and D2H copy latency benchmarks respectively.