NVIDIA Collective Communications Library (NCCL)
The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, that are optimized to achieve high bandwidth over PCIe and NVLink high-speed interconnect.
(Click to Zoom)
Developers of deep learning frameworks can rely on NCCL’s highly optimized, MPI compatible and topology aware routines, to take full advantage of all available GPUs within and across multiple nodes. Leading deep learning frameworks such as Caffe,Caffe2, Chainer, MxNet, TensorFlow, and PyTorch have integrated NCCL to accelerate deep learning training on multi-GPU systems. To download earlier versions of NCCL (1.x) versions please visit the NCCL’s GitHub page.
What’s New in NCCL 2.3
Deep learning frameworks using NCCL 2.3 and later can leverage new features and performance of the Volta and Turing architecture to deliver high-performance and efficient multi-node, multi-GPU scaling of deep learning training. NCCL 2.3 highlights include:
- Improved low latency algorithms for small message sizes
- Finer control of when to use GPU Direct P2P and GPU Direct RDMA
- Support multi-threaded and multi-process applications
- Faster training of newer and deeper models with aggregated inter-GPU reduction operations.
- Multiple ring formations for high bus utilization.
- Support for InfiniBand verbs, RoCE and IP Socket internode communication
- NVIDIA Deep Learning SDK documentation
- Blogs on Programming Tensor Cores in cuDNN
- Tensor Ops Made Easier in cuDNN
- Programming Tensor Cores in CUDA 9
- Related libraries and software:
- Find other NCCL developers on NVIDIA Developer Forums
- For questions or to provide feedback, please contact firstname.lastname@example.org
- To file bugs or report an issue,register on NVIDIA Developer Zone