GTC 2020: Fast Distributed Joins with RAPIDS and UCX
After clicking “Watch Now” you will be prompted to login or join.
Click “Watch Now” to login or join the NVIDIA Developer Program.
Fast Distributed Joins with RAPIDS and UCX
Nikolay Sakharnykh, NVIDIA | Hao Gao, University of Illinois Urbana-Champaign
There are numerous optimized single-GPU join implementations (such as RAPIDS cuDF), but scaling out to multiple GPUs across multiple nodes is challenging. The repartitioned join approach is one of the most popular distributed join algorithms, featuring all-to-all exchange as the main communication pattern. We'll show how to leverage UCX for efficient all-to-all implementation and demonstrate various optimization strategies, such as reusing communication buffers to speed up GPU-to-GPU transfers and overlapping compute with communications. The implementation is designed to reuse RAPIDS components for single-GPU, and scales to NVLINK and Infiniband systems. Our latest performance results demonstrate that a single DGX-2 can achieve 220 GB/s throughput for joining 8B/8B key-value pairs, while 18 DGX-1V nodes (144 GPUs) connected over IB achieve 503 GB/s, which is comparable with 244 CPU nodes (2K cores) in the best-known distributed CPU implementation.