NVIDIA Collective Communications Library (NCCL)
The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and networking.
How NCCL Works
NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, and point-to-point send and receive. These routines are optimized to achieve high bandwidth and low latency over PCIe,NVIDIA NVLink™, and other high-speed interconnects within a node and over NVIDIA networking across nodes.
With its single-kernel implementation of communication and computation, NCCL ensures low-latency synchronization, making it ideal for both distributed training and real-time inference scenarios. Developers can scale across nodes without tuning for specific hardware configurations, thanks to the NCCL dynamic topology detection and streamlined C-based API.
NCCL can be built and installed through Github. NCCL is also available for download as part of the NVIDIA HPC SDK and through binaries on the NVIDIA developer zone.
Performance
NCCL removes the need for developers to optimize their applications for specific machines. NCCL provides fast collectives over multiple GPUs both within and across nodes.
Ease of Programming
NCCL uses a simple C API that can be easily accessed from a variety of programming languages. NCCL closely follows the popular collectives API defined by Message Passing Interface (MPI).
Compatibility
NCCL is compatible with any multi-GPU parallelization model, including single-threaded, multi-threaded (using one thread per GPU), and multi-process (MPI combined with multi-threaded operation on GPUs).
Key Features
High-performance collective and point-to-point communication for faster multi-GPU and multi-node training
Device APIs that enable communication directly from CUDA kernels, unlocking lower latency and better compute–communication overlap
Automatic topology detection across PCIe, NVLink™, NVSwitch™, InfiniBand, RoCE, and other networks to maximize performance
Advanced graph search algorithms that build the most efficient rings and trees for peak bandwidth and minimal latency
Flexible plugin framework that extends NCCL to custom transports and next-generation interconnects
Full support for multi-threaded, multi-process, and MPI-driven distributed applications
Integrated profiling, reliability, and observability tools like NCCL RAS and NCCL Inspector to accelerate debugging and performance tuning
NCCL Blogs
More Resources
Check out the following videos presented by our NCCL team to learn more.
Learn more about related libraries and software.
Get started with NCCL today.