Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel

In LLM training, Expert Parallel (EP) communication for hyperscale mixture-of-experts (MoE) models is challenging. EP communication is essentially all-to-all, but due to its dynamics and sparseness (only topk experts per AI token instead of all experts), it’s challenging to implement and optimize.

This post details an efficient MoE EP communication solution, Hybrid-EP, and its use in the NVIDIA Megatron family of frameworks, on NVIDIA Quantum InfiniBand and NVIDIA Spectrum-X Ethernet platforms. It also dives into the effectiveness of Hybrid-EP in real-world model training.

Efficiency challenges of hyperscale MoE model training

DeepSeek-V3 is a representative model of the new generation of large-scale fine-grained MoE models. Such models balance computational overhead with model performance through “hyperparameter size sparse activation,” but they also pose serious challenges for existing large-model training frameworks.

Communication efficiency bottlenecks: The MoE model relies on parallel experts and requires frequent all-to-all communication. As the number of experts increases, the burden of EP communication increases. In DeepSeek-V3, communication time may account for more than 50% of overall training time without optimization.
Load imbalance: Dynamic routing mechanisms cause some “hot experts” to receive more tokens than average, while “cold experts” are underutilized, resulting in uneven computing load across devices and wasted computing power. This problem becomes more pronounced in a fine-grained scenario where the number of experts and the number of activated experts continue to increase.
Framework adaptability challenges: Today’s MoE models impose higher and more complex requirements for parallel strategies, low-precision computing, and dynamic resource scheduling. They also need optimization to maximize the potential of next-generation hardware architectures such as NVIDIA Blackwell, NVIDIA Quantum InfiniBand, and NVIDIA Spectrum-X Ethernet.

MoE training framework optimization and communication solution

NVIDIA Megatron Core—an open source and large-scale model training library—is a key foundation for training hyperscale MoE models. Its core benefits include:

Multidimensional parallelism strategies with support for tensor parallelism (TP), sequence parallelism, pipeline parallelism (PP), MoE expert parallelism (EP), and other strategies that can be flexibly combined to accommodate diverse and complex training workloads.
Resource and efficiency optimization integrates FP8 mixed-precision training, activation value offloading, a distributed optimizer, and fine-grained recalculation functions to reduce GPU memory consumption and fully support model training. It integrates multiple efficient operators (such as MLA, Attention, MLP, etc.) and provides various fusion optimization and pipeline scheduling strategies to improve computing performance.
MoE-specific adaptation provides complete support for mainstream MoE models, such as DeepSeek, Mixtral, and Qwen, with efficient, scalable training.

How Hybrid-EP is an efficient communication optimization solution

Hybrid-EP is a newly designed MoE EP communication library. It uses hardware and software advancements on the NVIDIA platform to achieve near-hardware-limits in communication bandwidth and minimize GPU hardware resource usage in RDMA-NVLink hybrid network architectures.

It implements two core operators in MoE EP communication: dispatch, which routes the tokens output by the attention operator to the corresponding experts, and combine, which routes the tokens output by the experts back to the attention operator. Routing and information-processing support is also added to enable complete EP communication.

The design goals and core optimization directions of Hybrid-EP include leveraging the latest communication technologies on the NVIDIA platform, such as TMA commands for data communication on NVLink scale-up networks, and low-level IBGDA network technology for RDMA networks. RDMA and NVLink hybrid network communication maximizes cross-domain bandwidth by combining intra-node NVLink and inter-node RDMA to improve algorithmic bandwidth. A data pipeline that masks most of the latency of communication and dynamic routing by cutting data into fine-grained chunks and streaming them through multiple levels of communication data pipelines, making EP bandwidth comparable to a highly optimized standard static all-to-all.

Minimized GPU streaming multiprocessor (SM) usage to maximize communication-computation overlap. Hybrid-EP achieves peak communication bandwidth with fewer SMs, leaving more SMs available for computation. Low-precision native support for FP8/ BF16 dispatch operators and BF16 combine operators.

Hybrid-EP designs each CUDA block as an independent data channel that occupies an SM to run a complete data pipeline. Different warp groups within a CUDA block handle different pipeline stages. CUDA blocks run in parallel and process different data chunks without synchronization and communication between the blocks.

The dotted boxes in Figure 2 represent pipeline stages used by RDMA network communication. The RDMA warp group is responsible for transmitting network traffic to RDMA network interface cards (NICs) using IBGDA technology to complete network communication and token data transmission between GPUs on the same rail and between different nodes (such as 0 GPUs across all nodes).

The G2S warp group is responsible for reading the token data owned by the GPU and the token data transmitted by the GPU of other nodes on the same rail into the shared memory First-In, First-Out (FIFO) queue inside the SM. The S2G warp group writes the token data from the shared memory FIFO queue inside the SM to the corresponding location in the output buffer of all GPUs in this node (including the GPU itself).

During this process, tokens are routed and transported according to the information in the routing map to avoid transmitting unwanted token data. Each CUDA block uses this data pipeline to process the token data in the data chunk in the order it is assigned. Different CUDA blocks handle different data chunks, using the same data pipeline.

As with the dispatch operator, the dotted frame section is only used for RDMA network communication. Because the combine operator performs high-accuracy accumulation operations on tokens, which are currently performed only at the CUDA core within the SM, these operations must be accumulated hierarchically.

In the multi-node case, the relevant intra-node warp groups first complete a portion of the cumulative work for the token within the node, and then the RDMA warp group sends the partially accumulated token to the GPU on the same rail between nodes. Finally, inter-node-related warp groups complete the global cumulative work to get the final result.

In the case of a single node, the cumulative work within the node is done directly by warp groups related to the inter-node. During this process, the input token is read by the corresponding G2S warp group to the shared memory G2S FIFO queue inside the SM, and then the corresponding reduction warp group accumulates tokens in the CUDA core. The result is stored in the shared memory S2G FIFO queue within the SM and handed over to the TMA unit for saving to the GPU memory.

Hybrid-EP was tested across many hardware platforms with the following test conditions:

HIDDEN_DIM is 8,192
DATA_TYPE is BF16, only transfer token.
NUM_OF_ATTN_TOKENS_PER_RANK is 4,096. NUM_OF_EXPERTS_PER_RANK is 2.
The routing map is generated randomly from a uniform distribution.
TOPK is 8.
Use the NVIDIA Quantum InfiniBand network.

The first test was run on an NVIDIA DGX Hopper platform with 8 H100 GPUs. Hybrid-EP fills NVLink bandwidth with only eight SMs.

Then, across four NVIDIA DGX Hopper Platforms, a total of 8×4 32-GPU configurations were tested on the cluster. Each of the four DGX H100 GPUs on the same rail had an NVIDIA ConnectX-7 NIC at 400 Gbps, connecting via the NVIDIA Quantum InfiniBand network.

Considering that Hybrid-EP performs hierarchical communication using NVLink RDMA hybrid networks, two sets of data were counted during the test:

The real speed achieved on the ConnectX-7 NIC is calculated using the amount of data passing through it and the total communication time, i.e., the NIC bus bandwidth.
Global bandwidth is calculated at the algorithm level and measures the bandwidth that dispatch and combine operators use in a hybrid network, i.e., the algorithm bandwidth.

Hybrid-EP requires only about 4 SMs to approach the NIC’s maximum bandwidth.

Finally, Hybrid-EP performance in large-scale NVLink networks on the NVIDIA Grace Blackwell was tested. The NVLink domain size used 36 GPUs, which is a GB200NVL36. Hybrid-EP requires only 16 SMs to fill NVLink bandwidth.

Practical cases: Combined hotspot model and hardware landing verification

Hybrid-EP is based on templates and CUDA C implementations with both input and output buffer addresses. Some additional integration work is required to use Hybrid-EP in the PyTorch-based Megatron Core framework. It’s now available in the DeepEP/Hybrid-EP Branch, and provides directly callable PyTorch operators, convenient for users to quickly complete integration and testing.

Since the Hybrid-EP kernel only accepts pointer parameters and isn’t responsible for memory management, it’s necessary to design a reasonable buffer management and allocation mechanism. Depending on the usage scenario, Hybrid-EP buffers can be roughly divided into two categories:

Registered buffer: Refers to specially registered GPU memory that can be accessed by kernels on other ranks. It is the only global static buffer. Registration depends on the scenario: cross-node communication registers GPU memory with the communication memory region, while non-cross-node communication uses a driver-API handle resolvable by other ranks.
Normal buffer: Refers to GPU memory allocated with cudaMalloc, which can be managed by PyTorch’s allocator and is usually not globally unique.

Because buffer application and registration operations are time-consuming, ideally, they are completed only during the Python Hybrid-EP initialization phase. However, the MoE model is dynamic, and the number of tokens received by the current rank in each iteration varies, changing the desired buffer size. To do this, use the worst-case preallocation strategy and apply a large buffer at the upper limit so that all tokens can converge to the same rank. Because this buffer is unique on a global scale, overall GPU memory usage remains controllable.

The PyTorch environment, Hybrid-EP’s workflow is shown in Figure 10. After preprocessing, synchronization is required because Torch needs GPU-side results to determine subsequent tensor sizes, while Hybrid-EP computes buffer sizes in the preprocessing kernel. This sync can be avoided if the host predefines a sufficiently large buffer size.

Optimization practices on Grace Blackwell

Megatron Core has integrated Hybrid-EP on the Grace Blackwell platform and can be optimized for different types of MoE models.

Model	Precision	Dispatcher	TFLOPS/GPU	Speedup
DeepSeek-V3	MXFP8	DeepEP	829	1x
DeepSeek-V3	MXFP8	Hybrid-EP	943	1.14x
DeepSeek-V3- FSDP	MXFP8	A2A	597	1x
DeepSeek-V3- FSDP	MXFP8	Hybrid-EP	645	1.08x
Qwen 3 235B	BF16	A2A	665	1x
	BF18	Hybrid-EP	698	1.05x
	MXFP8	A2A	728	1x
	MXFP8	Hybrid-EP	800	1.10x

Table 1. Throughput and speedup under different dispatchers

The results include:

DeepSeek-V3, 256 experts, topk-8 scenarios, using hybrid EP achieves about 14% performance improvement over DeepEP without adding MTP.
Megatron-FSDP, Hybrid EP still delivers about 8% performance improvement
Qwen 3 235B scenario, there is a 5.5% improvement in the BF16 scene, and about 9.9% improvement on the MXFP8.
Megatron-FSDP, Hybrid EP still delivers about 8% performance improvement

Learn more about how NVIDIA is enabling 10x performance and 1/10 cost for deploying MoE models.