Developer Tools & Techniques

Advancing Emerging Optimizers for Accelerated LLM Training with NVIDIA Megatron

Higher-order optimization algorithms such as Shampoo have been effectively applied in neural network training for at least a decade. These methods have achieved significant success more recently when applied to leading LLMs. In particular, Muon (MomentUm Orthogonalized by Newton-Schulz) was used to train some of today’s best open source models, including Kimi K2 and GLM-5.

This post explains how NVIDIA provides comprehensive support for Muon and other cutting-edge emerging optimizers and the technologies enabling them to train large-scale models.

Muon training performance on NVIDIA GB300 NVL72 

Table 1 summarizes training throughput of the Kimi K2 and Qwen3 30B models with Muon and the AdamW optimizer on the NVIDIA GB300 NVL72 system. With the technologies that will be introduced in the next section, the results show that there is a very small training performance loss using the Muon optimizer compared to AdamW. Model FLOPs utilization (MFU) is higher with Muon when counting the FLOPs of the matrix multiplication in Newton-Schulz iterations. 

These measurements were achieved using NVIDIA NeMo Megatron Bridge 26.02, a PyTorch-native library within the NeMo Framework that provides pretraining, SFT, and LoRA for popular LLM and VLM models. The NVIDIA team used 256 NVIDIA GB300 GPUs with PP4DP64EP64 for Kimi K2 training and eight NVIDIA GB300 GPUs with DP8EP8 for Qwen3 30B-A3B training.

NVIDIA GB300 MXFP8 (TFLOPs/s/GPU)
ModelAdamW Muon
Kimi K21,0511,080
(1,029 model plus 51 Muon)
Qwen3 30B-A3B713721
(686 model plus 35 Muon)
Table 1. Training throughput measured with NVIDIA Megatron-Bridge 26.02

For more detailed experimental settings and steps to reproduce the measurement, see the Quickstart section below.

Enabling technologies for large-scale Muon training

While higher-order optimizers like Muon promise to improve LLM training efficiency, their practical deployment at scale faces several hurdles. The preconditioning step, using Newton-Schulz iteration or eigen decomposition, significantly increases computational cost and memory consumption. In addition, numerical instability can arise during mixed-precision training and gradient accumulation. Efficiently distributing the synchronized, orthogonalized updates across thousands of GPUs may introduce communication bottlenecks that can diminish the overall efficiency benefits.

This section introduces technologies to overcome these hurdles and scale to thousands of GPUs. The selected technologies are balanced among generality, throughput, and implementation complexity. We favor technologies that can be generalized to Muon, SOAP (ShampoO with Adam in the Preconditioner’s eigenbasis), and other complex optimizers instead of solely focusing on Muon, for which more optimizations would apply because of its simplicity.

Layer-wise distributed optimizer

A traditional element-wise distributed optimizer (commonly applied to element-wise optimizers like AdamW at scale) provides the following:

  • Partitioning states: Instead of every GPU storing everything, the optimizer states are evenly sliced up across all available GPUs.
  • Reduce-scatter gradient: A reduce-scatter is performed over all gradients and each GPU gets a portion of gradients corresponding to the parameters it “owns.”
  • Local updates: Each GPU updates only the specific portion of the model parameters it “owns.”
  • AllGather parameters: After the update, GPUs communicate to ensure all devices have the updated weights of the full model for the next forward pass.

A more advanced version breaks down operations and overlaps communication with computations. 

Element-wise distributed optimizers have been working well with Adam optimizers. However, many emerging optimizers, including Muon, require gradients of the entire layer to calculate updates to the weights of that layer. If weights and optimizer states are evenly distributed among data parallel (DP) ranks, updates can’t be calculated based on the data available on each GPU. Additional communication will be needed to collect data for calculating the full update.

A layer-wise distributed optimizer is introduced to support this type of optimizer. Parameters of different layers are distributed to different DP ranks. Each GPU has full layers worth of parameters so that the preconditioner can be calculated.

One change that comes with layer-wise distribution is variable-size communication. Because whole layers vary in size, each GPU needs to collect differently sized parameter updates from different GPUs through all_gatherv

This layer-wise distributed optimizer is fully integrated into NVIDIA Megatron Core, an open source library for building and training large-scale models with advanced parallelism, mixed precision, and optimized GPU kernels. The implementation can be found in layer_wise_optimizer.py.

Distributed Newton-Schulz

LLM training at scale often uses tensor parallelism (TP) together with other parallelisms, which introduces unique challenges. TP splits individual weight matrices across multiple GPUs along specific dimensions, meaning no single device holds the full parameter tensor. The Muon optimizer critical orthogonalization step requires access to the entire momentum buffer matrix to compute. When this momentum buffer is sharded across devices, additional communication is necessary to handle the sharded momentum and weights.

Two methods are provided to handle such scenarios in TensorParallelMuon: duplicated mode and distributed mode.

Duplicated mode

Momentums are first all-gathered among TP the domain so that each GPU has all the data for the Newton-Schulz iteration. Then each GPU performs full Newton-Schulz iterations and updates the weights it owns by the corresponding orthogonalized momentum. This mode is network latency optimized at the cost of each GPU performing the same NS iterations. For each update, regardless of how many NS interactions are used (commonly five), only one all-gather communication is needed upfront.

Distributed mode

As the name suggests, computation of NS iterations is distributed among GPUs in the TP domain. Each NS interaction does three matrix multiplication, an all-reduce is needed after the first matrix multiplication of each iteration. This mode is computation optimized at the cost of more frequent communication.

In addition, a blockwise mode is also supported, which does orthogonalization and updates only with momentum and weights owned by each GPU. It is more cost effective to compute and doesn’t require any communication compared to the other two modes. However, it is not equivalent to orthogonalizing the entire momentum matrix together. Mathematically, it is conceptually a block orthogonalization (hence the name blockwise), similar to the blocking approach first introduced in Scalable Second Order Optimization for Deep Learning.

Additional optimizations

Other optimizations can further boost training throughput. Some important ones are briefly discussed in this section and will be gradually introduced through the NVIDIA Emerging Optimizers research project.

Communication hiding

In the current release of the NVIDIA Emerging Optimizers project, parameters are gathered immediately after the optimizer step and fully exposed, which could become bottleneck under certain circumstances. The same technique used in element-wise distributed optimizers can also apply, parameters gathering can be delayed to the forward step of the next batch and overlap with computation.

Load balancing

Because of various sizes of attention and MLP weight matrices, the computational cost of the optimizer because of the preconditioning also varies among layers. A perfect load balancing can’t be achieved even in theory. Currently, layers are distributed among GPUs in the same DP domain in round robin fashion. There are methods that can achieve better load balancing based on some estimation of computational and communication cost.

SYRK and fused all-reduce

The first two out of three matrix multiplications of an NS interaction can be mapped to SYRK (SYmmetric Rank-K update), so that close to half of the floating point operations can be saved. Saving is slightly less than half because tiles along the diagonal need to be fully computed. Triton kernels are provided in the current release.

SYRK outputs a full matrix instead of upper (or lower) triangle in compacted format. In the distributed mode, all-reduce still needs to communicate the entire matrix instead of half. A more advanced version can fuse communication into the SYRK kernel, which not only saves bandwidth but can also hide communication at finer granularity (that is, at the tile level). A CuTe DSL implementation is planned for a future release.

What other optimizers does NVIDIA support for research?

In addition to Muon, NVIDIA also supports many other optimizers for the research community to explore, including: 

Reproducing Muon training results 

This section explains how to reproduce the throughput results from Table 1 and how Muon is integrated into the NVIDIA software stack.

Quickstart

Follow the instructions in the performance recipes in the NVIDIA-NeMo/Megatron-Bridge GitHub repo.

As an example, to run Kimi on 256 NVIDIA GB300 GPUs, start from the Megatron Bridge repo root:

CONTAINER="nvcr.io/nvidia/nemo:26.04"

python scripts/performance/setup_experiment.py 
  --account <slurm_account> \
  -i ${CONTAINER} \
  --partition <slurm_partition> \
  -m kimi \
  -mr kimi_k2\
  --log_dir <result_dir> \
  --num_gpus 256 \
  --gpus_per_node 4 \
  -t "00:15:00" \
  -g gb300 \
  -c fp8_mx \
  -hf <HF_TOKEN>

Integrations

Muon is integrated in the Megatron Core, including a Muon class and layer-wise distributed wrapper.

Note that for Muon, use_distributed_optimizer will automatically dispatch it to the layer-wise distributed optimizer instead of the element-wise counterpart.

Get started with emerging optimizers for LLM training

Higher-order optimizers like Muon are proving essential for pushing the boundaries of LLM training efficiency. Through layerwise distributed optimization, tailored distributed Newton-Schulz iterations, and ongoing work on communication hiding and fused SYRK/all-reduce kernels, these powerful methods can now be deployed effectively at massive scale. With Megatron Core, you can start using them now.

  When choosing a configuration, keep these considerations in mind:

  • Distributed NS mode: Use duplicated mode when network latency dominates. Use distributed mode when computation is the bottleneck. Blockwise mode avoids communication entirely but approximates the full orthogonalization.
  • Scale: The layer-wise distributed optimizer enables Muon at large DP scales with minimal overhead. Table 1 shows near-parity with AdamW throughput on NVIDIA GB300.
  • Future improvements: Communication hiding, better load balancing, and fused SYRK/all-reduce kernels will further close any remaining throughput gap in upcoming releases.

Ready to get started? Check out the Megatron Bridge performance recipes

  

  

Discuss (0)

Tags