The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.

Leading deep learning frameworks such as Caffe2, Chainer, MxNet, PyTorch and TensorFlow have integrated NCCL to accelerate deep learning training on multi-GPU multi-node systems.

NCCL is available for download as part of the NVIDIA HPC SDK and as a separate package for Ubuntu and Red Hat.


NCCL conveniently removes the need for developers to optimize their applications for specific machines. NCCL provides fast collectives over multiple GPUs both within and across nodes.

Ease of Programming

NCCL uses a simple C API, which can be easily accessed from a variety of programming languages.NCCL closely follows the popular collectives API defined by MPI (Message Passing Interface).


NCCL is compatible with virtually any multi-GPU parallelization model, such as: single-threaded, multi-threaded (using one thread per GPU) and multi-process (MPI combined with multi-threaded operation on GPUs).

Key Features

  • Automatic topology detection for high bandwidth paths on AMD, ARM, PCI Gen4 and IB HDR
  • Up to 2x peak bandwidth with in-network all reduce operations utilizing SHARPV2
  • Graph search for the optimal set of rings and trees with the highest bandwidth and lowest latency
  • Support multi-threaded and multi-process applications
  • InfiniBand verbs, libfabric, RoCE and IP Socket internode communication
  • Reroute traffic and alleviate congested ports with InfiniBand Adaptive routing


Ready to start developing with NCCL?

Download NCCL