NVIDIA Collective Communications Library (NCCL)

Multi-GPU and multi-node collective communication primitives

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, that are optimized to achieve high bandwidth over PCIe and NVLink high-speed interconnect.

(Click to Zoom)

Developers of deep learning frameworks and HPC applications can rely on NCCL’s highly optimized, MPI compatible and topology aware routines, to take full advantage of all available GPUs within and across multiple nodes. This allows them to focus on developing new algorithms and software capabilities, rather than performance tuning low-level communication collectives.

Leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and TensorFlow use NCCL to deliver near-linear scaling of deep learning training on multi-GPU systems. To download earlier versions of NCCL (1.x) versions please visit the NCCL’s GitHub page.


DOWNLOAD NCCL

What’s New in NCCL 2.2?

Deep learning frameworks using NCCL 2 and later, can leverage new features and performance of the Volta architecture to deliver high-performance and efficient multi-node, multi-GPU scaling of deep learning training. NCCL 2.2 highlights include:

  • Deliver faster multi-GPU training of deep neural networks on such as ResNet50 and other larger networks, with aggregated inter-GPU reduction operations.
NCCL 2.2 will be freely available in May for members of the NVIDIA Developer Program.

Key Features

  • Multi-gpu and multi-node communication collectives such as all-gather, all-reduce, broadcast, reduce, reduce-scatter
  • Automatic topology detection to determine optimal communication path
  • Optimized to achieve high bandwidth over PCIe and NVLink high-speed interconnect
  • Support multi-threaded and multiprocess applications
  • Multiple ring formations for high bus utilization
  • Support for InfiniBand verbs, RoCE and IP Socket internode communication

Learn More