The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.
NCCL is available for download as part of the NVIDIA HPC SDK and as a separate package for Ubuntu and Red Hat.
NCCL conveniently removes the need for developers to optimize their applications for specific machines. NCCL provides fast collectives over multiple GPUs both within and across nodes.
Ease of Programming
NCCL uses a simple C API, which can be easily accessed from a variety of programming languages.NCCL closely follows the popular collectives API defined by MPI (Message Passing Interface).
NCCL is compatible with virtually any multi-GPU parallelization model, such as: single-threaded, multi-threaded (using one thread per GPU) and multi-process (MPI combined with multi-threaded operation on GPUs).
- Automatic topology detection for high bandwidth paths on AMD, ARM, PCI Gen4 and IB HDR
- Up to 2x peak bandwidth with in-network all reduce operations utilizing SHARPV2
- Graph search for the optimal set of rings and trees with the highest bandwidth and lowest latency
- Support multi-threaded and multi-process applications
- InfiniBand verbs, libfabric, RoCE and IP Socket internode communication
- Reroute traffic and alleviate congested ports with InfiniBand Adaptive routing
- NVIDIA Deep Learning SDK documentation
- Technical Blog: Massively Scale Your Deep Learning Training with NCCL 2.4
- Technical Blog: Scaling Deep Learning Training with NCCL 2.3
- Related libraries and software:
- To file bugs or report an issue, register on NVIDIA Developer Zone
Ready to start developing with NCCL?