NVIDIA Collective Communications Library (NCCL)

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and networking. 

Download NCCLDocumentation GitHubNCCL API Guide


How NCCL Works

NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, and point-to-point send and receive. These routines are optimized to achieve high bandwidth and low latency over PCIe,NVIDIA NVLink™, and other high-speed interconnects within a node and over NVIDIA networking across nodes.

With its single-kernel implementation of communication and computation, NCCL ensures low-latency synchronization, making it ideal for both distributed training and real-time inference scenarios. Developers can scale across nodes without tuning for specific hardware configurations, thanks to the NCCL dynamic topology detection and streamlined C-based API.

NCCL can be built and installed through Github. NCCL is also available for download as part of the NVIDIA HPC SDK and through binaries on the NVIDIA developer zone.

 This is how NVIDIA Collective Communication Library (NCCL) works

Performance

NCCL removes the need for developers to optimize their applications for specific machines. NCCL provides fast collectives over multiple GPUs both within and across nodes.

Ease of Programming

NCCL uses a simple C API that can be easily accessed from a variety of programming languages. NCCL closely follows the popular collectives API defined by Message Passing Interface (MPI).

Compatibility

NCCL is compatible with any multi-GPU parallelization model, including single-threaded, multi-threaded (using one thread per GPU), and multi-process (MPI combined with multi-threaded operation on GPUs).


Key Features

  • High-performance collective and point-to-point communication for faster multi-GPU and multi-node training

  • Device APIs that enable communication directly from CUDA kernels, unlocking lower latency and better compute–communication overlap

  • Automatic topology detection across PCIe, NVLink™, NVSwitch™, InfiniBand, RoCE, and other networks to maximize performance

  • Advanced graph search algorithms that build the most efficient rings and trees for peak bandwidth and minimal latency

  • Flexible plugin framework that extends NCCL to custom transports and next-generation interconnects

  • Full support for multi-threaded, multi-process, and MPI-driven distributed applications

  • Integrated profiling, reliability, and observability tools like NCCL RAS and NCCL Inspector to accelerate debugging and performance tuning


NCCL Blogs


More Resources

Check out the following videos presented by our NCCL team to learn more.

Learn more about related libraries and software.

Decorative image representing Developer Newsletter

Submit a Bug, RFE, or Question

Decorative image representing Developer Community

Join the NVIDIA Developer Program

Get started with NCCL today.

Download NCCL