The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnect.

Get Started


NCCL conveniently removes the need for developers to optimize their applications for specific machines. NCCL provides fast collectives over multiple GPUs both within and across nodes.

Ease of Programming

NCCL uses a simple C API, which can be easily accessed from a variety of programming languages.NCCL closely follows the popular collectives API defined by MPI (Message Passing Interface).


NCCL is compatible with virtually any multi-GPU parallelization model, such as: single-threaded, multi-threaded (using one thread per GPU) and multi-process (MPI combined with multi-threaded operation on GPUs).

Key Features

  • Automatic topology detection for high bandwidth paths on AMD, ARM, PCI Gen4 and IB HDR
  • Up to 2x peak bandwidth with in-network all reduce operations utilizing SHARPV2
  • Graph search for the optimal set of rings and trees with the highest bandwidth and lowest latency
  • Support multi-threaded and multi-process applications
  • InfiniBand verbs, libfabric, RoCE and IP Socket internode communication
  • Reroute traffic and alleviate congested ports with InfiniBand Adaptive routing

What's New in NCCL 2.7

NCCL 2.7 highlights include:

  • Up to 2x bandwidth on NVIDIA A100 GPUs compared to V100 GPUs
  • Preview of point-to-point communication capability to enable model parallel training for models like Wide & Deep and DLRM
  • Compatible with CUDA 11

Read the latest NCCL release notes for a detailed list of new features and enhancements.


Ready to get started developing with NCCL?

Get Started