NCCL

Nov 10, 2025

Fusing Communication and Compute with New Device API and Copy Engine Collectives in NVIDIA NCCL 2.28

The latest release of the NVIDIA Collective Communications Library (NCCL) introduces a groundbreaking fusion of communication and computation for higher...

9 MIN READ

Nov 10, 2025

Building Scalable and Fault-Tolerant NCCL Applications

The NVIDIA Collective Communications Library (NCCL) provides communication APIs for low-latency and high-bandwidth collectives, enabling AI workloads to scale...

12 MIN READ

Oct 20, 2025

Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems

Modern AI workloads have moved well beyond single-GPU inference serving. Model parallelism, which efficiently splits computation across many GPUs, is now the...

10 MIN READ

Sep 09, 2025

How to Connect Distributed Data Centers Into Large AI Factories with Scale-Across Networking

AI scaling is incredibly complex, and new techniques in training and inference are continually demanding more out of the data center. While data center...

6 MIN READ

Jul 22, 2025

Understanding NCCL Tuning to Accelerate GPU-to-GPU Communication

The NVIDIA Collective Communications Library (NCCL) is essential for fast GPU-to-GPU communication in AI workloads, using various optimizations and tuning to...

14 MIN READ

Jul 18, 2025

Optimizing for Low-Latency Communication in Inference Workloads with JAX and XLA

Running inference with large language models (LLMs) in production requires meeting stringent latency constraints. A critical stage in the process is LLM decode,...

6 MIN READ

Jul 14, 2025

Enabling Fast Inference and Resilient Training with NCCL 2.27

As AI workloads scale, fast and reliable GPU communication becomes vital, not just for training, but increasingly for inference at scale. The NVIDIA Collective...

9 MIN READ

Jul 14, 2025

NCCL Deep Dive: Cross Data Center Communication and Network Topology Awareness

As the scale of AI training increases, a single data center (DC) is not sufficient to deliver the required computational power. Most recent approaches to...

9 MIN READ

Jun 18, 2025

Improved Performance and Monitoring Capabilities with NVIDIA Collective Communications Library 2.26

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode communication primitives optimized for NVIDIA GPUs and networking. NCCL...

11 MIN READ

Typical data center interconnection schema for Clos fabric.

May 14, 2025

AI Fabric Resiliency and Why Network Convergence Matters

High-performance computing and deep learning workloads are extremely sensitive to latency. Packet loss forces retransmission or stalls in the communication...

7 MIN READ

Mar 13, 2025

Networking Reliability and Observability at Scale with NCCL 2.24

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode (MGMN) communication primitives optimized for NVIDIA GPUs and networking....

14 MIN READ

Jan 31, 2025

New Scaling Algorithm and Initialization with NVIDIA Collective Communications Library 2.23

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode communication primitives optimized for NVIDIA GPUs and networking. NCCL...

9 MIN READ

Oct 25, 2024

Advancing Performance with NVIDIA SHARP In-Network Computing

AI and scientific computing applications are great examples of distributed computing problems. The problems are too large and the computations too intensive to...

7 MIN READ

Decorative image of a cube of green cubes, surrounded by other cubes on a dark background.

Sep 16, 2024

Memory Efficiency, Faster Initialization, and Cost Estimation with NVIDIA Collective Communications Library 2.22

For the past few months, the NVIDIA Collective Communications Library (NCCL) developers have been working hard on a set of new library features and bug fixes....

8 MIN READ

Sep 06, 2024

Enhancing Application Portability and Compatibility across New Platforms Using NVIDIA Magnum IO NVSHMEM 3.0

NVSHMEM is a parallel programming interface that provides efficient and scalable communication for NVIDIA GPU clusters. Part of NVIDIA Magnum IO and based on...

7 MIN READ

Apr 26, 2024

Perception Model Training for Autonomous Vehicles with Tensor Parallelism

Due to the adoption of multicamera inputs and deep convolutional backbone networks, the GPU memory footprint for training autonomous driving perception models...

10 MIN READ