Fusing Communication and Compute with New Device API and Copy Engine Collectives in NVIDIA NCCL 2.28

The latest release of the NVIDIA Collective Communications Library (NCCL) introduces a groundbreaking fusion of communication and computation for higher throughput, reduced latency, and maximized GPU utilization across multi-GPU and multi-node systems.

NCCL 2.28 focuses on GPU-initiated networking, device APIs for communication-compute fusion, copy-engine-based collectives, and new APIs for developers to build efficient, scalable distributed applications. Alongside these performance innovations, this release also enhances the developer experience with expanded APIs, improved tooling, and cleaner integration paths, empowering developers to write custom communication kernels and scale their applications with greater flexibility and efficiency.

Release highlights

Improvements to performance, monitoring, reliability, and quality of service are supported through the following features:

Device API: enables development of custom device kernels for communication/compute fusion, including GPU-initiated networking, for kernels to perform network operations directly.
Copy Engine (CE) -based collectives: Developers can use CEs to drive NVIDIA NVLink transfers, reducing compute-resource contention for streaming multiprocessors (SM).
NCCL Inspector: Provides a low-overhead profiling plugin that enables always-on observability and analysis of NCCL communication patterns.

How the NCCL device API enables direct kernel communication

NCCL 2.28 introduces a device-side communication API for direct communication within NVIDIA CUDA kernels. Previously, all NCCL operations were host-initiated, introducing synchronization overhead. With the new API, kernels initiate data movement directly, integrating communication with compute operations, which yields more throughput and less overhead. To use the new API, you must use data buffers with symmetric memory windows.

Currently, three operation modes are supported:

Load/Store Accessible (LSA): For communication between devices accessible via memory load/store operations, using CUDA P2P.
Multimem: For communication between devices using the hardware multicast feature provided by NVLink SHARP
GPU Initiated Networking (GIN): For communication between devices initiated by the GPU using the network.

Flowchart showing three NCCL Device API operation modes for GPU memory communication. “LSA” (Load/Store Accessible Memory) enables access to remote memory over PCI for load/store operations. “Multimem” (NVLink Sharp) provides multicast load/reduce operations over NVLink. “GIN” (GPU-Initiated Networking) supports put, wait, and signal operations over network mechanisms such as GDAKI and Proxy. The diagram maps each mode to its corresponding interconnect technology. — *Figure 1.* *Three modes of operation are supported via the Device API for memory communication and underlying communication mechanisms*

More information is available in the online documentation.

Special mention should be made of GPU-Initiated Networking (GIN), newly introduced in NCCL 2.28.7, that enables GPUs to manage their own network operations without CPU intervention. Kernels can directly enqueue data transfers, signals, and synchronization steps, removing bottlenecks caused by host-driven control paths.

Accelerating NCCL performance with copy engine offload

Achieving top communication performance at scale requires more SMs to saturate NVLink bandwidth. This increasing allocation for communication tasks creates resource competition with compute kernels, which can reduce overall application performance.

CE collectives offload communication tasks within the NVLink domain from SMs to dedicated hardware CEs. In contrast to traditional NCCL collectives, CE-based collectives achieve zero-SM operation. This approach applies to collectives that require only data movement, such as AlltoAll and AllGather. This frees up SM resources for computational workloads and improves the overlap of communication and computation. Both can now execute concurrently without competing for the same hardware resources.

CE-based collectives employ several optimizations for enhanced performance. For instance, they utilize batched APIs (e.g., cudaMemcpyBatchAsync) to group multiple CE operations into single calls, reducing CUDA driver overhead. Additionally, they use NVLink multicast optimization to broadcast synchronization signals efficiently.

These collectives achieve performance comparable to SM-based collectives without requiring SM resources. The following figure shows that CE-based AllGather achieves higher peak bandwidth than SM-based AllGather. This performance advantage also partially stems from CE-initiated NVLink transactions using larger transaction widths compared to SM-initiated transactions.

Line chart titled “Allgather Bandwidth – Size (1 Node – 8 GPUs Blackwell Umbriel)” comparing CE Multicast, CE Unicast, SM Symmetric (NCCL Optimized), and SM NonSymmetric (NCCL Default). The x-axis shows message sizes from 4 MB to 4 GB, and the y-axis shows bandwidth in GBps. CE Multicast achieves the highest bandwidth at larger message sizes, peaking around 780 GBps, while SM NonSymmetric peaks near 620 GBps—indicating a 1.25× bandwidth gain for CE Multicast over SM Symmetric at 4 GB. — *Figure 2. NCCL AllGather performance with CE-based and SM-based implementation*

Review the requirements and learn how to enable CE collectives in the NCCL documentation.

Profiling and observability made easy with NCCL Inspector

The NCCL Inspector is an observability, profiling, and analysis plugin that provides detailed, per-communicator and per-collective performance and metadata logging. This plugin should be used in always-running mode. It’s designed to help users analyze and debug NCCL collective operations by generating structured JSON output for each operation, enabling insights into communication patterns and performance characteristics during a distributed workload run using NCCL.

The NCCL Inspector uses the NCCL Profiler plugin interface architecture to integrate into NCCL use cases. Key features include:

Per-communicator tracking: The inspector tracks each NCCL communicator individually, enabling users to analyze performance patterns across different communication contexts. This is particularly valuable in complex distributed applications where multiple communicators may be used for different purposes.
Always-on: The low overhead of the plugin means that the inspector can be used in production workloads to provide observability into NCCL without diminished performance.
Event tracing: The plugin captures detailed event traces, including collective start/stop events, kernel channel operations, and timing information.
Performance metrics: NCCL Inspector calculates and reports key performance metrics, including algorithmic bandwidth, bus bandwidth, execution time, timing source information (GPU vs CPU timing), message sizes, and collective types.

The plugin is designed to work alongside other NCCL plugins and can provide valuable data for tuner plugins that use performance feedback to optimize communication patterns. While the current design doesn’t provide direct shared context between the NCCL Inspector and the tuner plugins, the detailed performance data generated by the inspector can be used by external analysis tools to inform tuning decisions.

Line chart titled “HCA-Only ReduceScatter,” showing collective bus bandwidth versus collective sequence number for three NCCL Inspector runs. The x-axis spans sequence numbers from 0 to about 2,800, and the y-axis shows collective bus bandwidth. Three lines represent individual runs: green for 8.39MB_r_32 (highest), blue for 8.40MB_r_32, and red for 5.85MB_r_32 (lowest), each remaining mostly stable across the range. — *Figure 3. Elastic and Kibana dashboard visualization of NCCL Inspector data*

The plugin operates independently of network-specific implementations, making it compatible with various network technologies supported by NCCL. This ensures that the inspector can provide insights regardless of the underlying network infrastructure used.

More information can be found in ext-profiler/inspector/README.md.

Improved developer experience with NCCL 2.28

NCCL 2.28 also extends beyond core communication and compute fusion by delivering a suite of enhancements that improve flexibility, performance tuning, and the developer experience. New APIs, kernel orchestration, configuration, profiling, and build systems allow developers to gain greater control over communication workflows with improved observability and streamlined deployments across diverse hardware and environments. The following quick run-through highlights these improvements.

New host APIs for AllToAll, Gather, and Scatter

Introducing native host-level APIs for AlltoAll, Gather, and Scatter operations enables NCCL to apply advanced optimizations. For example, NCCL can use copy engines to reduce SM usage for these communication patterns, an optimization also introduced in this release. These performance enhancements are only possible with dedicated native APIs.

Symmetric kernels group call support

Single symmetric kernels already provide excellent latency, bandwidth, and resource usage when used in an NVLINK domain. NCCL 2.28 adds support for grouped symmetric kernels for improved performance and resource usage. Users can register window buffers as usual and call multiple NCCL collectives between ncclGroupStart() and ncclGroupEnd(). During launch, NCCL automatically detects potential collectives that can use symmetric kernels and groups and schedules them into a single kernel for higher efficiency. For window buffer registration, please refer to NCCL Window Buffer Registration.

Flexible config management with NCCL environment plugin

Optimizing communication parameters is crucial for performance, and NCCL offers robust mechanisms for this through configuration files and environment variables. When different jobs require unique NCCL versions and configurations, manual version matching is needed with current file-based methods, which doesn’t integrate well with modern deployment systems like databases or cloud schedulers.

The new NCCL environment plugin API offers a flexible, programmatic alternative with key advantages:

Programmatic version matching automatically applies the correct configuration for each NCCL version.
Storage agnostic configuration uses settings from files, databases, or environment variables.
Enhanced flexibility and control by overriding some parameters programmatically while preserving others for fine-grained control.
Initialization and resource management start once and stay active through runtime, and cleanly release resources.
Future-proof and ready for new per-communicator configurations.

When enabled, the plugin integrates with the NCCL parameter subsystem and overrides lower-priority configuration mechanisms, ensuring a consistent, version-aware setup. It simplifies large-scale, multi-environment deployments, freeing users from the limitations of static file-based systems.

Shared context for plugins

AI model training increasingly spans multiple data centers, creating major challenges for communication libraries. A redesigned plugin system supporting shared contexts and per-communicator tuning replaces global initialization and enables context‑aware optimizations across diverse network environments.

The following network plugin API has been updated:

// Network plugin v11
ncclResult_t (*init)(void** ctx, uint64_t commId, ncclNetCommConfig_v11_t* config, ...)
ncclResult_t (*finalize)(void* ctx);
ncclResult_t (*listen)(void* ctx, ...);
ncclResult_t (*connect)(void* ctx, int dev, void* handle, ...);

Each communicator now initializes with a commId and network config, returning a per-communicator context handle (ctx) for isolation and fine‑grained tuning. This mechanism, already used in tuner and profiler plugins, now extends to the network plugin. Developers can also place code for network, tuner, and profiler plugins in one .so library via the NCCL_NET_PLUGIN variable to enable shared contexts and inter‑plugin communication.

NCCL profiler API events

Previously, the NCCL profiler plugin interface didn’t capture NCCL API calls or CUDA kernel launch events, making it unable to correlate launches, collective operations, or point-to-point events with the corresponding NCCL API calls.

The profiler now enables:

Displaying API events in user order. This ensures correct sequencing, even if operations are grouped with ncclGroupStart/ncclGroupEnd, which may alter scheduling order.
Measuring NCCL operation overhead on the CPU, i.e., the time taken by NCCL to schedule an operation after the user’s API call.
Correlating CUDA kernel launches directly with the originating NCCL API calls.
Linking collective and point-to-point tasks scheduled by NCCL to their original API calls. API events persist across graph launches, allowing lower-level tasks for each graph launch to be correlated with the user’s original API call.

CMake-based build system

NCCL now also supports CMake for Linux builds, offering a modernized alternative to Make. The CMake system simplifies integration into larger build pipelines and cross-platform projects while maintaining compatibility with legacy systems. With standardized CMake support, projects can now leverage familiar workflows used across the broader CUDA and HPC ecosystem, improving reproducibility and reducing maintenance overhead. The result is a more flexible, modular, and developer-friendly build experience.

Get started with NCCL 2.28

Discover what’s new in NCCL 2.28 and push the boundaries of distributed training with enhanced scalability, optimized collective performance, and improved cross-node efficiency.

For detailed documentation, source code, and community support, visit the NVIDIA/nccl GitHub repository and checkout the provided examples. To learn more about tuning NCCL for your system architecture, see the NCCL documentation.