The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, that are optimized to achieve high bandwidth over PCIe and NVink high-speed interconnect.
(Click to Zoom)
Developers of deep learning frameworks and HPC applications can rely on NCCL’s highly optimized, MPI compatible and topology aware routines, to take full advantage of all available GPUs within and across multiple nodes. This allows them to focus on developing new algorithms and software capabilities, rather than performance tuning low-level communication collectives.
Leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and TensorFlow use NCCL to deliver near-linear scaling of deep learning training on multi-GPU systems.
NCCL 2 will be freely available for download in July for members of the NVIDIA developer program. Sign up below to be notified when NCCL 2 is ready for download.
Earlier versions of NCCL along with source code are freely available for download on NCCL’s GitHub page.