After clicking “Watch Now” you will be prompted to login or join.
Fast Distributed Joins with RAPIDS and UCX
Nikolay Sakharnykh, NVIDIA | Hao Gao, University of Illinois Urbana-Champaign
GTC 2020
There are numerous optimized single-GPU join implementations (such as RAPIDS cuDF), but scaling out to multiple GPUs across multiple nodes is challenging. The repartitioned join approach is one of the most popular distributed join algorithms, featuring all-to-all exchange as the main communication pattern. We'll show how to leverage UCX for efficient all-to-all implementation and demonstrate various optimization strategies, such as reusing communication buffers to speed up GPU-to-GPU transfers and overlapping compute with communications. The implementation is designed to reuse RAPIDS components for single-GPU, and scales to NVLINK and Infiniband systems. Our latest performance results demonstrate that a single DGX-2 can achieve 220 GB/s throughput for joining 8B/8B key-value pairs, while 18 DGX-1V nodes (144 GPUs) connected over IB achieve 503 GB/s, which is comparable with 244 CPU nodes (2K cores) in the best-known distributed CPU implementation.