NVIDIA Blackwell Delivers up to 2.6x Higher Performance in MLPerf Training v5.0

The journey to create a state-of-the-art large language model (LLM) begins with a process called pretraining. Pretraining a state-of-the-art model is computationally demanding, with popular open-weights models featuring tens to hundreds of billions parameters and trained using trillions of tokens. As model intelligence grows with increasing model parameter count and training dataset size, the amount of compute required to train a model grows, and so higher performance training clusters are required to unlock smarter, more capable models while keeping training times in check.

After a model is pretrained, it can then be post-trained to further enhance its capabilities. For example, an enterprise may customize a pretrained model with their own, proprietary data sets to enhance knowledge and response accuracy for that organization’s specific use cases. Other post-training techniques can be applied to increase supported context length, as well as augment a model with reasoning capabilities. In aggregate, while post-training a single model may be less computationally intensive than pretraining a model today, it is both growing fast as researchers find new ways to increase model capabilities and can be done by many organizations to customize models.

MLPerf Training v5.0 is the latest version of the long-running MLPerf Training series of benchmarks, which measure how quickly a platform can train models to predetermined quality thresholds. The benchmark suite currently consists of seven benchmarks that span a range of domains—LLM pretraining, LLM fine-tuning, text-to-image generation, recommender systems, graph neural networks, natural language processing, and object detection.

In this latest round of MLPerf Training, the NVIDIA platform delivered the fastest time to train across all seven benchmarks.

Benchmark	Time to Train (minutes)
LLM Pretraining (Llama 3.1 405B)	20.8
LLM Fine-Tuning (Llama 2 70B-LoRA)	0.56
Text-to-Image (Stable Diffusion v2)	1.04
Graph Neural Network (R-GAT)	0.84
Recommender (DLRM-DCNv2)	0.7
Natural Language Processing (BERT)	0.3
Object Detection (RetinaNet)	1.4

Table 1. Performance results at scale on the NVIDIA platform in MLPerf Training v5.0

MLPerf Training v5.0 results retrieved from www.mlcommons.org on June 4, 2025, from the following entries: 5.0-0010 (NVIDIA), 5.0-0074 (NVIDIA), 5.0-0076 (NVIDIA), 5.0-0077 (NVIDIA), 5.0-0087 (SuperMicro). The MLPerf name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommoms.org for more information.

This round also marked the first MLPerf Training submissions using the NVIDIA GB200 NVL72 rack-scale system, with excellent results from both NVIDIA as well as many NVIDIA partners. This post provides a closer look at these results and how they were achieved.

NVIDIA Blackwell delivers a big boost for LLM pretraining

NVIDIA Blackwell incorporates many architectural innovations compared to the prior-generation NVIDIA Hopper architecture. These advancements include significant increases in the compute performance per-GPU as well as technologies, like fifth-generation NVLink and NVlink Switch, that increase bandwidth between GPUs and significantly expand NVLink domain size, to enable model developers to train models more quickly.

The breakthroughs include the new second-generation Transformer Engine, faster and wider NVIDIA NVLink interconnects, and higher-bandwidth and higher-capacity HBM3e memory. These architectural capabilities are activated by many innovations across the NVIDIA software stack and allowed GB200 NVL72 to train 2.2x faster compared to Hopper when 512 GPUs are used to run the Llama 3.1 405B benchmark. GB200 NVL72 achieves up to 1,960 TFLOPS of training throughput with Llama 3.1 405B Pretraining benchmark.

Benchmark	# GPUs	Hopper	Blackwell	Blackwell Speedup
Llama 3.1 405B	512	269.12 min.	121.09 min.	2.2x

Table 2. Blackwell achieved 2.2x more performance per GPU compared to Hopper at the same submission scale of 512 GPUs

MLPerf Training v5.0 Closed. Results retried on June 4, 2025, from www.mlcommons.org, from the following entries: 5.0-0014 and 5.0-0076. Results verified by MLCommons Association. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

The GB200 NVL72 system features 72 Blackwell GPUs connected on a single NVLink domain through NVLink connections on the GPU and NVLink Switch chips present in the GB200 NVL72 racks. NVIDIA and partner submissions on GB200 NVL72 used a model parallelism mapping optimized for the GB200 NVL72 system topology to maximize training throughput.

Blackwell also features dramatically higher peak compute performance compared to Hopper across popular AI data formats. To make use of this increased compute performance, the NVIDIA cuBLAS library—which features key linear algebra operations, including general matrix multiply (GEMM)—has been optimized for both the Blackwell architecture with additional tuning specific to GB200 NVL72.

Another significant optimization is the use of CUDA Graphs for processing the full forward-backward graph on each GPU for LLMs, rather than using one graph per transformer layer. This significantly reduces the memory footprint associated with using CUDA Graphs by allowing GPU memory to be reused across transformer layers. It also helps to minimize host CPU overhead during execution, a critical optimization when driving blazingly fast Blackwell GPUs. As training is scaled out across many thousands of GPUs and the amount of work performed by each GPU shrinks, elimination of host overheads by CUDA Graphs also significantly improves the scalability of LLMs.

For GB200 NVL72 systems, CUDA Graphs may additionally help improve the GPU-allocated power budget by reducing the energy consumption on the CPU through automatic power steering. To learn more, see the NVIDIA Grace Performance Tuning Guide.

Next, to increase GPU utilization on GB200 NVL72, the NVIDIA submission this round featured optimized overlapped execution of GEMM and GPU-to-GPU communication operations. These include using CUDA stream priority feature to increase the priority of communication kernels at scheduler level when they run concurrently with math kernels; and using copy-engine based implementations for both reduce-scatter and all-gather used for tensor parallelism (TP) to minimize the SM requirements of TP collectives when math operations are in the critical path. These optimizations are available through the NVIDIA software stack that includes NeMo, Megatron-Core, Transformer Engine, and cuBLAS libraries.

The GB200 NVL72 system features 9x the NVLink domain size of the prior-generation Hopper architecture. To optimize performance and provide excellent scalability, NVIDIA has implemented a capability in the Megatron-Core training library that allows for flexible ordering of parallel mappings. In particular, in addition to the existing Tensor Parallel-Context Parallel-Data Parallel-Pipeline Parallel (TP-CP-DP-PP) mapping supported by Megatron-Core, TP-CP-PP-DP (or “DP-Last”) is now also supported. This was found to be optimal for the GB200 NVL72-based system when running the Llama 3.1 405B benchmark at the scales used for this submission (512 to 2,496 GPUs).

The GB200 NVL72 system consists of 18 nodes, all connected through NVLink Switch, each node having four GPUs. General guidelines for scaling workloads on multinode NVLink systems are as follows:

Keep the critical path on NVLink: Latency- or bandwidth-sensitive traffic that may have a large impact on end-to-end performance should be routed over NVLink whenever possible.
Reduce contention among communicators: Interference should be minimized between communicators both inside NVLink and on the inter-rack network (InfiniBand, for example) to prevent bandwidth-stealing contention.
Limit the blast radius of failures: When training at large scales, assigning fewer than all 18 nodes in a rack to any single job mitigates the risk that a single GPU or node failure will put the entire rack out of action for that job.

Applying these principles, the TP4 – CP2 – PP8 – DP39 layout emerges as optimal for the NVIDIA 2,496-GPU-scale Blackwell submission: TP, CP, and PP communication stay on NVLink, while only DP traffic leaves the rack. This scheme engages 64 GPUs per rack.

And, finally, the submission this round makes use of enhanced Flash Attention kernels in the backward pass that more carefully manage GPU register usage to minimize register spills. The optimization is available directly through cuDNN, starting with v9.9.0. Due to two-way context parallelism, attention execution is split into two GPUs, each processing up to 4,096 sequence length. With this optimization, the attention backward kernel using causal mask and a sequence length of 4,096 speeds up by 1.3x.

Blackwell accelerates LLM fine-tuning

Many organizations will customize existing pretrained models to deliver high accuracy for specific tasks or application domains. MLPerf Training v5.0 incorporates an LLM fine-tuning benchmark that applies the low-rank adaptation (LoRA) technique to Llama 2 70B. Faster model fine-tuning enables organizations to more quickly deploy models customized for their specific use cases, speeding time to deployment.

Compared to the NVIDIA submission using an NVIDIA DGX H100 system with eight NVIDIA H100 Tensor Core GPUs in the prior round, eight Blackwell GPUs running as part of a GB200 NVL72 system deliver 2.5x faster time to train.

Benchmark	# GPUs	Hopper	Blackwell	Blackwell Speedup
Llama 2 70B LoRA	8	27.93 min.	11.14 min.	2.51x

Table 3. Blackwell achieved 2.5x more performance per GPU at 8-GPU scale on the Llama 2 70B LoRA finetuning benchmark

MLPerf Training v5.0 Closed. Results retried on June 4, 2025, from www.mlcommons.org, from entry 5.0-0071. H100 result from MLPerf Training v4.1 Closed from entry 4.1-0050. Results verified by MLCommons Association. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

In addition to the large speedups enabled by the higher compute performance of each Blackwell GPU compared to Hopper, the larger memory capacity of Blackwell allowed the entire Llama 2 70B model to fit into a single GPU. This allowed reducing the amount of model parallel communication, in turn increasing per-GPU throughput.

The SwiGLU input in the NVIDIA submission is also stored in FP8 rather than the larger BF16 format, which further reduces memory footprint. The optimization, coupled with the larger memory capacity above, enables training fully with just data parallelism, avoiding all model-parallel communication overhead.

This submission also benefited from the enhanced Root Mean Square Layer Normalization (RMSNorm) kernels that are part of cuDNN. RMSNorm is a crucial operation employed in most recent LLMs for improved stability as the model grows.

Finally, after results submission, NVIDIA implemented additional optimizations that increased performance on both Hopper and Blackwell GPUs. These optimizations are planned for the NVIDIA NeMo Framework 25.07 release.

Llama 2 70B LoRA	# GPUs	June 2025 Unverified Result	Speedup versus Verified Result
NVIDIA H200	8	21.84 min.	10%
Blackwell (GB200 NVL72)	8	10.34 min.	8%

Table 4. Unverified results on Hopper and Blackwell deliver additional performance improvements compared to verified results

MLPerf Training v5.0 Closed. Results retried on June 4, 2025, from www.mlcommons.org, from entry 5.0-0071. H200 result from MLPerf Training v4.1 Closed from entry 4.1-0047. June 2025 performance results not verified by MLCommons Association. Unverified results have not been through an MLPerf review and may use measurement methodologies and/or workload implementations inconsistent with the MLPerf specification for verified results. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

Blackwell boosts text-to-image pretraining

On the Stable Diffusion v2 pretraining benchmark, compared to NVIDIA submissions in the prior round using the H100 Tensor Core GPU, GB200 NVL72 delivered 2.6x higher performance per GPU and set a new performance record at scale.

Benchmark	# GPUs	Hopper	Blackwell	Blackwell Speedup
Stable Diffusion v2	8	33.97 min.	12.86 min.	2.64x

Table 5. Blackwell delivered 2.6x higher performance per GPU compared to Hopper

Behind these great results were several key optimizations.

First is an improved Apex GroupNorm kernel that enables a reduction in memory footprint and also increases performance. Next, the NVIDIA submission this round improved data-parallel communications by pipelining the reduce-scatter and AllReduce operations in the Apex DistributedAdam kernel. And, finally, by increasing the distributed optimizer group size to make use of 72 GPUs within an NVLink domain, the NVIDIA submission this round achieved higher performance at both 72-GPU scale submissions as well as at maximum scale of 512 GPUs.

Blackwell speeds up graph neural network training performance

On the R-GAT training test, based on R-GAT, NVIDIA submissions using GB200 NVL72 delivered a 2.2x improvement in per-GPU performance compared to NVIDIA submissions using the H100 Tensor Core GPU.

Benchmark	# GPUs	Hopper	Blackwell	Blackwell Speedup
GNN	8	11.18 min.	4.97 min.	2.25x

Table 6. Blackwell delivered 2.25x higher performance per GPU on the GNN benchmark compared to Hopper

MLPerf Training v5.0 Closed. Results retried on June 4, 2025, from www.mlcommons.org, from entry 5.0-0071. H100 result from MLPerf Training v4.1 Closed from entry 4.1-0048. Results verified by MLCommons Association. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

These results were enabled by many optimizations, including extending the scope of CUDA Graphs to include the optimizer, which reduced CPU overhead.

We also fused several small copy operations that are latency limited which were needed to set up data buffers for CUDA Graphs through a Triton kernel, significantly reducing the overhead of launching copy operations.

Key takeaways

Submissions this round using NVIDIA GB200 NVL72-based systems, which are powered by the NVIDIA Blackwell platform, delivered outstanding performance in MLPerf Training v5.0. In addition to delivering up to 2.6x more performance per GPU compared to Hopper, GB200 NVL72 submissions also delivered excellent performance at scale, and demonstrated near-linear scaling efficiency on the demanding LLM pretraining benchmark based on Llama 3.1 405B.

These performance increases can enable shorter time to solution and ultimately time to value, as AI models move from training and post-training to deployment. More performance can enable training of larger and more complex base models, laying the foundation for even more capable reasoning models.

To reproduce these results for NVIDIA MLPerf v5.0 submissions of Llama 2 70B LoRA fine-tuning and Llama 405B pretraining, see Reproducing NVIDIA MLPerf v5.0 Training Scores for LLM Benchmarks. Submission repositories also include README files to reproduce the scores for all benchmarks. See, for example, those for the Llama 2 70B LoRA fine-tuning benchmark and the Llama 3.1 405B benchmark.