NVIDIA Blackwell Architecture Sweeps MLPerf Training v5.1 Benchmarks

The NVIDIA Blackwell architecture powered the fastest time to train across every MLPerf Training v5.1 benchmark, marking a clean sweep in the latest round of results. As developers experiment with new architectures, and models continue to grow in size, more training compute is essential. Meeting this need for delivered compute requires innovation across every layer of the AI stack—from chips and systems to software—advancing performance at an unprecedented pace.

MLPerf Training v5.1 is the latest in the long-running series of industry benchmarks designed to measure AI training performance. This version measures the time to train seven models, representing a wide range of use cases, each to a specified target accuracy. The Blackwell architecture, which powers both NVIDIA Blackwell and NVIDIA Blackwell Ultra GPUs, delivered the highest performance on every benchmark at maximum scale and at each submitted scale.

Benchmark	Time to train	Maximum Submission scale
Llama 3.1 405B pretraining	10 minutes	5,120 Blackwell GPUs
LLama 3.1 8B pretraining	5.2 minutes	512 Blackwell Ultra GPUs
Llama 2 70B LoRA fine-tuning	0.40 minutes	512 Blackwell Ultra GPUs
FLUX.1	12.5 minutes	1,152 Blackwell GPUs
DLRM-DCNv2	0.71 minutes	64 Blackwell GPUs
R-GAT	0.84 minutes	256 Blackwell GPUs
RetinaNet	1.4 minutes	512 Blackwell GPUs

Table 1. The NVIDIA platform delivers the fastest time to train on every model currently tested in MLPerf Training

MLPerf™ Training v5.0 and v5.1 results retrieved from www.mlcommons.org on November 12, 2025, from the following entries: 5.0-0082, 5.1-0002, 5.1-0004, 5.1-0060, 5.1-0070, 5.1-0072. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

The NVIDIA platform was also the only one to submit results on all benchmarks. In this post, we take a closer look at these results and the technology innovations that powered them.

NVIDIA makes the industry’s first FP4 training submissions with NVFP4

Innovation in low-precision AI data formats is a key enabler of the performance gains delivered by the Blackwell architecture, which powers Blackwell and Blackwell Ultra GPUs. The Blackwell architecture incorporates hardware acceleration for FP4 data formats, including the NVIDIA-designed NVFP4 format. Blackwell GPUs offer peak FP4 throughput per clock, twice that of FP8. Blackwell Ultra GPUs build upon that innovation, increasing FP4 throughput per clock to 3x that of FP8.

As shown in the paper, Pretraining Large Language Models with NVFP4, NVFP4 provides better accuracy for the same number of tokens used during training, or achieves the same accuracy using significantly fewer tokens, compared to the industry MXFP4 data format. This means faster time to train to a specified accuracy and faster time to deployment with lower training costs.

This round, NVIDIA adopted NVFP4 in every large language model (LLM) in MLPerf Training by incorporating many of the techniques recommended in the paper. NVIDIA submissions also carefully applied “healing”—a process by which higher precisions are used during certain parts of the training process—to improve accuracy. Specifically, NVIDIA submissions kept the last few training iterations in FP8 precision.

These submissions required innovation at every layer of the technology stack, including hardware acceleration of NVFP4 directly in Blackwell and Blackwell Ultra silicon, acceleration libraries including NVIDIA cuBLAS, NVIDIA Transformer Engine, and NVIDIA Megatron-Core, and new numerical techniques.

Blackwell Ultra delivers a large leap for LLM training

NVIDIA submitted the first MLPerf Training results on Blackwell Ultra using an NVIDIA AI cluster codenamed “Theia,” after the Greek goddess of sight and vision. It features a total of 512 Blackwell Ultra GPUs, built from multiple NVIDIA GB300 NVL72 rack-scale systems connected using NVIDIA Quantum-X800 InfiniBand.

Blackwell Ultra GPUs incorporate several important enhancements compared to Blackwell GPUs, including:

1.5x peak NVFP4 throughput. Blackwell Ultra GPUs feature updated Tensor Cores that increase peak FP4 throughput per clock by 1.5x compared to Blackwell GPUs. This helps accelerate math-bound GEMM operations.

2x Softmax for attention. Blackwell Ultra GPUs feature an upgraded special function unit (SFU), providing 2x accelerated throughput for key softmax operations, which can be critical for the attention layer. In MLPerf benchmarks, this results in up to 1.3x speedup in the attention block.

1.5x larger HBM3e capacity. Blackwell Ultra GPUs incorporate higher-capacity HBM3e stacks, which are now 12-Hi compared to 8-Hi in Blackwell GPUs. On the Llama 2 70B LoRA benchmark, this enabled us to fit the entire model in one GPU, with no CPU offloading required, eliminating model-parallel communication overheads and improving GEMM efficiency.

Blackwell Ultra GPU innovations, adoption of NVFP4 format, and software optimizations delivered large increases in pretraining and LLM fine-tuning performance with the same number of GPUs compared to the most recent NVIDIA submissions using the Hopper architecture.

Two sets of bar charts, with performance starting with Hopper submissions in prior rounds, followed by Blackwell GB200 NVL72 submissions in v5.0, then finally Blackwell Ultra GB300 NVL72 submissions in v5.1. The speedups listed for Llama 3.1 405B are 1x, ~2x, and 4x+, and 1x, ~3x, and ~5x for Llama 2 70B LoRA, respectively. — *Figure 1. Relative Llama 3.1 405B pretraining and Llama 2 70B LoRA fine-tuning performance at 512-GPU and 8-GPU scales, respectively*

MLPerf Training v4.1, v5.0, and v5.1, closed division. Results from entries: 4.1-0050, 5.0-0076, 5.0-0067, 5.1-0058, 5.1-0060. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

Additionally, the latest NVIDIA Quantum-X800 networking platform—composed of NVIDIA ConnectX-8 SuperNICs, NVIDIA Quantum-X800 InfiniBand switches, and NVIDIA LinkX cables—was used to connect the multiple GB300 NVL72 racks that form the Theia cluster. This marks the industry’s first and only 800 Gb/s networking submitted to MLPerf Training.

NVIDIA Blackwell sets new Llama 3.1 405B training record

On Llama 3.1 405B, the largest and most challenging benchmark in the MLPerf Training v5.1, NVIDIA set a new time-to-train record of 10 minutes powered by 5,120 Blackwell GPUs. This is a 2.7x increase compared to the fastest submission using Blackwell GPUs last round.*

Two major factors contributed to this large speedup. With the use of NVFP4 training recipes and general software enhancements, the submission using 2,560 Blackwell GPUs achieved a score of 18.79 minutes. This is 3x faster than the previous NVIDIA submissions with the same number of NVIDIA Hopper architecture GPUs.* Effective performance per Blackwell GPU also increased by 42%, when comparing the performance of the 2,496 Blackwell GPU submission last round to the 2,560 Blackwell GPU submission this round.*

* MLPerf™ Training v5.0 and v5.1 results retrieved from www.mlcommons.org on November 12, 2025, from the following entries: 5.0-0067, 5.0-0002, 5.0-0003, 5.0-0004, 5.1-0003, 5.1-0004, 5.1-0071. Performance-per-GPU is not an official MLPerf metric, and is derived by dividing the ratios of delivered performance and scales submitted. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

A dark green line chart indicating MLPerf Training v5.0 baseline, which scales from 512 Blackwell GPUs to 2,496 Blackwell GPUs. Then a lighter green line indicating Blackwell submissions in MLPerf Training v5.1, with points at 512 GPUs, 2,560 GPUs, and 5,120 GPUs. At the 2,560 GPU mark, performance/GPU in v5.1 is indicated as 1.4x that of v5.0, at the 2,496 GPU point. At 5,120 GPUs, a 2.7x increase in perf at max scale is indicated. — *Figure 2. Performance scaling with the number of Blackwell GPUs submitted in both MLPerf Training v5.0 and MLPerf Training v5.1.*

MLPerf™ Training v5.0 and v5.1 results retrieved from www.mlcommons.org on November 12, 2025, from the following entries: 5.0-0001, 5.0-0002, 5.0-0003, 5.0-0004, 5.0-0005, 5.0-0013, 5.0-0014, 5.1-0003, 5.1-0004, 5.1-0071. Performance-per-GPU is not an official MLPerf metric, and is derived by dividing the ratios of delivered performance and scales submitted. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

This submission also used a total of 5,120 Blackwell GPUs—more than doubling the largest submitted scale of 2,496 Blackwell GPUs in the prior round—connected using NVLink for scale-up within a rack, and NVIDIA Quantum-2 InfiniBand for scale-out to multiple racks. Performance increased by 2.7x, meaning that the gains resulted from a larger scale and increased effective performance per GPU.

Scaling efficiency, which is the amount of performance increase with additional GPUs, scaled by 10x from 512 Blackwell GPUs to 5,120 Blackwell GPUs, was 85%.

This is critical as it enables model builders to scale training runs, accelerating time to train and time to revenue, while ensuring that each of those incremental GPUs achieves high utilization.

Blackwell Ultra sets the bar for Llama 3.1 8B training performance

To ensure that MLPerf Training results represent modern AI use cases, the benchmark is regularly updated. This round, BERT-large was replaced by Llama 3.1 8B, which provides a substantial increase in capability and training complexity while maintaining a simple, accessible LLM for a broader range of platforms.

The NVIDIA platform delivered the highest performance on the Llama 3.1 8B training benchmark, both in terms of performance at a given number of GPUs and performance at scale.

Llama 3.1 8B submissions also benefited from several full-stack optimizations.

One was the use of NVFP4 training recipes, which enabled performance increases while maintaining accuracy, even with a much smaller model.

Next, with increased context lengths, attention becomes a critical component of the end-to-end LLM pretraining performance. Previous NVIDIA LLM pretraining submissions used BF16 precision for the inputs of batched-matrix-multiply (BMM) computations in the attention block. This round, NVIDIA submissions used FP8 precision for the attention BMM inputs for the Llama 3.1 8B pretraining benchmark. This applied to forward and backward pass computation, greater FP8 precision for attention BMMs.

Our FP8 recipe achieved up to 1.3x better performance in the attention kernel of MLPerf benchmarks compared to the BF16 counterpart while still meeting the accuracy requirements of the benchmark.

The FP8 attention recipe used for the pretraining benchmarks this round uses per-tensor current scaling FP8 for query (Q), key (K), and value (V) tensors, as well as the gradient of output (dO) used in backward propagation. FP8 attention resulted in a 5% end-to-end speedup in the Llama 3.1 8B model. The FP8 attention implementation, for delayed scaling and current scaling recipes, is available in the NVIDIA cuDNN library, which is used in NVIDIA MLPerf submissions through the NVIDIA Transformer Engine library.

Other software optimizations implemented for pretraining models include the following, which focused on optimizing away the device-to-device memory copies and tensor concatenations:

Implementing a fused RoPE kernel in Transformer Engine that uses combined Q/K/V input and outputs Q, K, V tensors. This avoided splitting Q,K,V tensors in the forward pass, and concatenating dQ, dK, dV tensors in the backward pass
Avoiding changes to the attention input to BSHD layout by using SBHD attention layout. This change was implemented in Megatron-LM. In this notation, B stands for batch size, S sequence length, H number of attention heads, and D head dimension, consistent with Transformer Engine notation.
Fusing amax computation into the producer operations.

Highest performance on new FLUX.1 benchmark

Another benchmark update was the addition of the FLUX.1 image generation model, replacing Stable Diffusion v2. On this test, NVIDIA once again set the bar, delivering the fastest time to train at scale of 12.5 minutes using 1,152 Blackwell GPUs. NVIDIA was also the only platform to submit to this benchmark, highlighting both the performance and versatility of the NVIDIA training stack.

Llama 2 70B LoRA software optimizations

This round, several fusion optimizations were implemented that benefited the Llama 2 70B LoRA fine-tuning benchmark significantly. The core idea is using LoRALinearLayer, which combines the LoRA adapters and the frozen GEMM within the same module. Building this abstraction enables us to fuse cast operations, scaling operations, and the addition to the frozen GEMM.

Key takeaways

NVIDIA is innovating on a one-year rhythm, with innovation across GPU, CPU, scale-up networking, scale-out networking, system architecture, and software, to drive up performance, drive down intelligence costs, and pave the way for new AI breakthroughs.

See more NVIDIA performance data on the Data Center Deep Learning Product Performance Hub and Performance Explorer pages.

NVIDIA Blackwell Architecture Sweeps MLPerf Training v5.1 Benchmarks

NVIDIA makes the industry’s first FP4 training submissions with NVFP4

Blackwell Ultra delivers a large leap for LLM training

NVIDIA Blackwell sets new Llama 3.1 405B training record

Blackwell Ultra sets the bar for Llama 3.1 8B training performance

Highest performance on new FLUX.1 benchmark

Llama 2 70B LoRA software optimizations

Key takeaways

Tags

About the Authors

NVIDIA Blackwell Architecture Sweeps MLPerf Training v5.1 Benchmarks

NVIDIA makes the industry’s first FP4 training submissions with NVFP4

Blackwell Ultra delivers a large leap for LLM training

NVIDIA Blackwell sets new Llama 3.1 405B training record

Blackwell Ultra sets the bar for Llama 3.1 8B training performance

Highest performance on new FLUX.1 benchmark

Llama 2 70B LoRA software optimizations

Key takeaways

Tags

About the Authors

Comments

Related posts

NVIDIA Blackwell Doubles LLM Training Performance in MLPerf Training v4.1

Leading MLPerf Training 2.1 with Full Stack Optimizations for AI

Boosting NVIDIA MLPerf Training v1.1 Performance with Full Stack Optimization

NVIDIA Breaks AI Performance Records in Latest MLPerf Benchmarks

Optimizing NVIDIA AI Performance for MLPerf v0.7 Training

Related posts

NVIDIA Blackwell Enables 3x Faster Training and Nearly 2x Training Performance Per Dollar than Previous-Gen Architecture

Top 5 AI Model Optimization Techniques for Faster, Smarter Inference

NVIDIA Blackwell Doubles LLM Training Performance in MLPerf Training v4.1

NVIDIA GH200 Grace Hopper Superchip Delivers Outstanding Performance in MLPerf Inference v4.1

NVIDIA Blackwell Platform Sets New LLM Inference Records in MLPerf Inference v4.1