The NVIDIA Blackwell architecture powered the fastest time to train across every MLPerf Training v5.1 benchmark, marking a clean sweep in the latest round of results. As developers experiment with new architectures, and models continue to grow in size, more training compute is essential. Meeting this need for delivered compute requires innovation across every layer of the AI stack—from chips and systems to software—advancing performance at an unprecedented pace.
MLPerf Training v5.1 is the latest in the long-running series of industry benchmarks designed to measure AI training performance. This version measures the time to train seven models, representing a wide range of use cases, each to a specified target accuracy. The Blackwell architecture, which powers both NVIDIA Blackwell and NVIDIA Blackwell Ultra GPUs, delivered the highest performance on every benchmark at maximum scale and at each submitted scale.
| Benchmark | Time to train | Maximum Submission scale |
| Llama 3.1 405B pretraining | 10 minutes | 5,120 Blackwell GPUs |
| LLama 3.1 8B pretraining | 5.2 minutes | 512 Blackwell Ultra GPUs |
| Llama 2 70B LoRA fine-tuning | 0.40 minutes | 512 Blackwell Ultra GPUs |
| FLUX.1 | 12.5 minutes | 1,152 Blackwell GPUs |
| DLRM-DCNv2 | 0.71 minutes | 64 Blackwell GPUs |
| R-GAT | 0.84 minutes | 256 Blackwell GPUs |
| RetinaNet | 1.4 minutes | 512 Blackwell GPUs |
MLPerf™ Training v5.0 and v5.1 results retrieved from www.mlcommons.org on November 12, 2025, from the following entries: 5.0-0082, 5.1-0002, 5.1-0004, 5.1-0060, 5.1-0070, 5.1-0072. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
The NVIDIA platform was also the only one to submit results on all benchmarks. In this post, we take a closer look at these results and the technology innovations that powered them.
NVIDIA makes the industry’s first FP4 training submissions with NVFP4
Innovation in low-precision AI data formats is a key enabler of the performance gains delivered by the Blackwell architecture, which powers Blackwell and Blackwell Ultra GPUs. The Blackwell architecture incorporates hardware acceleration for FP4 data formats, including the NVIDIA-designed NVFP4 format. Blackwell GPUs offer peak FP4 throughput per clock, twice that of FP8. Blackwell Ultra GPUs build upon that innovation, increasing FP4 throughput per clock to 3x that of FP8.
As shown in the paper, Pretraining Large Language Models with NVFP4, NVFP4 provides better accuracy for the same number of tokens used during training, or achieves the same accuracy using significantly fewer tokens, compared to the industry MXFP4 data format. This means faster time to train to a specified accuracy and faster time to deployment with lower training costs.
This round, NVIDIA adopted NVFP4 in every large language model (LLM) in MLPerf Training by incorporating many of the techniques recommended in the paper. NVIDIA submissions also carefully applied “healing”—a process by which higher precisions are used during certain parts of the training process—to improve accuracy. Specifically, NVIDIA submissions kept the last few training iterations in FP8 precision.
These submissions required innovation at every layer of the technology stack, including hardware acceleration of NVFP4 directly in Blackwell and Blackwell Ultra silicon, acceleration libraries including NVIDIA cuBLAS, NVIDIA Transformer Engine, and NVIDIA Megatron-Core, and new numerical techniques.
Blackwell Ultra delivers a large leap for LLM training
NVIDIA submitted the first MLPerf Training results on Blackwell Ultra using an NVIDIA AI cluster codenamed “Theia,” after the Greek goddess of sight and vision. It features a total of 512 Blackwell Ultra GPUs, built from multiple NVIDIA GB300 NVL72 rack-scale systems connected using NVIDIA Quantum-X800 InfiniBand.
Blackwell Ultra GPUs incorporate several important enhancements compared to Blackwell GPUs, including:
- 1.5x peak NVFP4 throughput. Blackwell Ultra GPUs feature updated Tensor Cores that increase peak FP4 throughput per clock by 1.5x compared to Blackwell GPUs. This helps accelerate math-bound GEMM operations.
- 2x Softmax for attention. Blackwell Ultra GPUs feature an upgraded special function unit (SFU), providing 2x accelerated throughput for key softmax operations, which can be critical for the attention layer. In MLPerf benchmarks, this results in up to 1.3x speedup in the attention block.
- 1.5x larger HBM3e capacity. Blackwell Ultra GPUs incorporate higher-capacity HBM3e stacks, which are now 12-Hi compared to 8-Hi in Blackwell GPUs. On the Llama 2 70B LoRA benchmark, this enabled us to fit the entire model in one GPU, with no CPU offloading required, eliminating model-parallel communication overheads and improving GEMM efficiency.
Blackwell Ultra GPU innovations, adoption of NVFP4 format, and software optimizations delivered large increases in pretraining and LLM fine-tuning performance with the same number of GPUs compared to the most recent NVIDIA submissions using the Hopper architecture.

MLPerf Training v4.1, v5.0, and v5.1, closed division. Results from entries: 4.1-0050, 5.0-0076, 5.0-0067, 5.1-0058, 5.1-0060. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.
Additionally, the latest NVIDIA Quantum-X800 networking platform—composed of NVIDIA ConnectX-8 SuperNICs, NVIDIA Quantum-X800 InfiniBand switches, and NVIDIA LinkX cables—was used to connect the multiple GB300 NVL72 racks that form the Theia cluster. This marks the industry’s first and only 800 Gb/s networking submitted to MLPerf Training.
NVIDIA Blackwell sets new Llama 3.1 405B training record
On Llama 3.1 405B, the largest and most challenging benchmark in the MLPerf Training v5.1, NVIDIA set a new time-to-train record of 10 minutes powered by 5,120 Blackwell GPUs. This is a 2.7x increase compared to the fastest submission using Blackwell GPUs last round.*
Two major factors contributed to this large speedup. With the use of NVFP4 training recipes and general software enhancements, the submission using 2,560 Blackwell GPUs achieved a score of 18.79 minutes. This is 3x faster than the previous NVIDIA submissions with the same number of NVIDIA Hopper architecture GPUs.* Effective performance per Blackwell GPU also increased by 42%, when comparing the performance of the 2,496 Blackwell GPU submission last round to the 2,560 Blackwell GPU submission this round.*
* MLPerf™ Training v5.0 and v5.1 results retrieved from www.mlcommons.org on November 12, 2025, from the following entries: 5.0-0067, 5.0-0002, 5.0-0003, 5.0-0004, 5.1-0003, 5.1-0004, 5.1-0071. Performance-per-GPU is not an official MLPerf metric, and is derived by dividing the ratios of delivered performance and scales submitted. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
MLPerf™ Training v5.0 and v5.1 results retrieved from www.mlcommons.org on November 12, 2025, from the following entries: 5.0-0001, 5.0-0002, 5.0-0003, 5.0-0004, 5.0-0005, 5.0-0013, 5.0-0014, 5.1-0003, 5.1-0004, 5.1-0071. Performance-per-GPU is not an official MLPerf metric, and is derived by dividing the ratios of delivered performance and scales submitted. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
This submission also used a total of 5,120 Blackwell GPUs—more than doubling the largest submitted scale of 2,496 Blackwell GPUs in the prior round—connected using NVLink for scale-up within a rack, and NVIDIA Quantum-2 InfiniBand for scale-out to multiple racks. Performance increased by 2.7x, meaning that the gains resulted from a larger scale and increased effective performance per GPU.
Scaling efficiency, which is the amount of performance increase with additional GPUs, scaled by 10x from 512 Blackwell GPUs to 5,120 Blackwell GPUs, was 85%.
This is critical as it enables model builders to scale training runs, accelerating time to train and time to revenue, while ensuring that each of those incremental GPUs achieves high utilization.
Blackwell Ultra sets the bar for Llama 3.1 8B training performance
To ensure that MLPerf Training results represent modern AI use cases, the benchmark is regularly updated. This round, BERT-large was replaced by Llama 3.1 8B, which provides a substantial increase in capability and training complexity while maintaining a simple, accessible LLM for a broader range of platforms.
The NVIDIA platform delivered the highest performance on the Llama 3.1 8B training benchmark, both in terms of performance at a given number of GPUs and performance at scale.
Llama 3.1 8B submissions also benefited from several full-stack optimizations.
One was the use of NVFP4 training recipes, which enabled performance increases while maintaining accuracy, even with a much smaller model.
Next, with increased context lengths, attention becomes a critical component of the end-to-end LLM pretraining performance. Previous NVIDIA LLM pretraining submissions used BF16 precision for the inputs of batched-matrix-multiply (BMM) computations in the attention block. This round, NVIDIA submissions used FP8 precision for the attention BMM inputs for the Llama 3.1 8B pretraining benchmark. This applied to forward and backward pass computation, greater FP8 precision for attention BMMs.
Our FP8 recipe achieved up to 1.3x better performance in the attention kernel of MLPerf benchmarks compared to the BF16 counterpart while still meeting the accuracy requirements of the benchmark.
The FP8 attention recipe used for the pretraining benchmarks this round uses per-tensor current scaling FP8 for query (Q), key (K), and value (V) tensors, as well as the gradient of output (dO) used in backward propagation. FP8 attention resulted in a 5% end-to-end speedup in the Llama 3.1 8B model. The FP8 attention implementation, for delayed scaling and current scaling recipes, is available in the NVIDIA cuDNN library, which is used in NVIDIA MLPerf submissions through the NVIDIA Transformer Engine library.
Other software optimizations implemented for pretraining models include the following, which focused on optimizing away the device-to-device memory copies and tensor concatenations:
- Implementing a fused RoPE kernel in Transformer Engine that uses combined Q/K/V input and outputs Q, K, V tensors. This avoided splitting Q,K,V tensors in the forward pass, and concatenating dQ, dK, dV tensors in the backward pass
- Avoiding changes to the attention input to BSHD layout by using SBHD attention layout. This change was implemented in Megatron-LM. In this notation, B stands for batch size, S sequence length, H number of attention heads, and D head dimension, consistent with Transformer Engine notation.
- Fusing amax computation into the producer operations.
Highest performance on new FLUX.1 benchmark
Another benchmark update was the addition of the FLUX.1 image generation model, replacing Stable Diffusion v2. On this test, NVIDIA once again set the bar, delivering the fastest time to train at scale of 12.5 minutes using 1,152 Blackwell GPUs. NVIDIA was also the only platform to submit to this benchmark, highlighting both the performance and versatility of the NVIDIA training stack.
Llama 2 70B LoRA software optimizations
This round, several fusion optimizations were implemented that benefited the Llama 2 70B LoRA fine-tuning benchmark significantly. The core idea is using LoRALinearLayer, which combines the LoRA adapters and the frozen GEMM within the same module. Building this abstraction enables us to fuse cast operations, scaling operations, and the addition to the frozen GEMM.
Key takeaways
NVIDIA is innovating on a one-year rhythm, with innovation across GPU, CPU, scale-up networking, scale-out networking, system architecture, and software, to drive up performance, drive down intelligence costs, and pave the way for new AI breakthroughs.
See more NVIDIA performance data on the Data Center Deep Learning Product Performance Hub and Performance Explorer pages.