NVIDIA NCCL

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.

Leading deep learning frameworks such as Caffe2, Chainer, MxNet, PyTorch and TensorFlow have integrated NCCL to accelerate deep learning training on multi-GPU multi-node systems.

NCCL is available for download as part of the NVIDIA HPC SDK and as a separate package for Ubuntu and Red Hat.

Download NCCL Documentation Developer Guide

GitHub Watch GTC Webinar

Performance

NCCL conveniently removes the need for developers to optimize their applications for specific machines. NCCL provides fast collectives over multiple GPUs both within and across nodes.

Ease of Programming

NCCL uses a simple C API, which can be easily accessed from a variety of programming languages.NCCL closely follows the popular collectives API defined by MPI (Message Passing Interface).

Compatibility

NCCL is compatible with virtually any multi-GPU parallelization model, such as: single-threaded, multi-threaded (using one thread per GPU) and multi-process (MPI combined with multi-threaded operation on GPUs).

Key Features

Automatic topology detection for high bandwidth paths on AMD, ARM, PCI Gen4 and IB HDR
Up to 2x peak bandwidth with in-network all reduce operations utilizing SHARPV2
Graph search for the optimal set of rings and trees with the highest bandwidth and lowest latency
Support multi-threaded and multi-process applications
InfiniBand verbs, libfabric, RoCE and IP Socket internode communication
Reroute traffic and alleviate congested ports with InfiniBand Adaptive routing

RESOURCES

NVIDIA Deep Learning SDK documentation
Technical Blog: Massively Scale Your Deep Learning Training with NCCL 2.4
Technical Blog: Scaling Deep Learning Training with NCCL 2.3
Related libraries and software:
- HPC SDK
- cuDNN
- cuBLAS
- DALI
- NVIDIA GPU Cloud
- Magnum IO
To file bugs or report an issue, register on NVIDIA Developer Zone

Ready to start developing with NCCL?