Data Center / Cloud

NVIDIA Blackwell Enables 3x Faster Training and Nearly 2x Training Performance Per Dollar than Previous-Gen Architecture

AI innovation continues to be driven by three scaling laws: pre-training, post-training, and test-time scaling. Training is foundational to building smarter models, and post-training—which can include fine-tuning, reinforcement learning, and other techniques—helps to further increase accuracy for specific tasks, as well as provide models with new capabilities like the ability to reason. 

As foundation models grow larger and more complex, they require more compute performance to train. This can lead to longer time and cost to train for a fixed amount of performance. And, as AI researchers continue to experiment to identify their next set of model architecture breakthroughs, still more training compute is required before the final pre-training run. Innovation is needed to dramatically increase delivered compute to train more complex models while reducing the cost per unit of compute. 

That’s where NVIDIA extreme codesign comes in. NVIDIA innovates across GPUs, CPUs, NVIDIA NVLink Switches, network interface cards (NICs), data processing units (DPUs), the NVIDIA Quantum InfiniBand platform, the NVIDIA Spectrum-X Ethernet platform, system architecture, and a mountain of software to deliver large increases in training performance—far beyond what Moore’s Law is able to deliver. And, these increases in performance not only mean shorter training times—allowing model builders to more quickly deploy their models to begin generating revenue—but also lower model training costs, increasing return on investment. 

That’s why the world’s leading AI models today are trained on NVIDIA.

In this post, we take a closer look at how new chips, as well as continued software stack innovations on the same architecture, can dramatically accelerate time-to-train while significantly reducing cost-to-train. 

NVIDIA GB200 NVL72 delivers a large leap over Hopper

In the last round of MLPerf Training, NVIDIA made the industry’s first and only submissions using FP4 precision—and did so across every large language model (LLM) in the benchmarking suite. These breakthroughs—the result of the NVFP4 precision accelerated in hardware by the NVIDIA Blackwell architecture, new training recipes, and overall software stack enhancements—mean that GB200 NVL72 delivered up to 3.2x faster training performance on the Llama 3.1 405B benchmark at the same GPU count compared to optimized submissions using NVIDIA Hopper running FP8. 

A bar chart showing Hopper performance as 1x and Blackwell performance at 3.2x.
Figure 1. Llama 3.1 405B training performance on 512 Hopper and 512 Blackwell GPUs. 

This faster time-to-train means that model developers can bring their models to market sooner, accelerating their ability to generate revenue from their latest AI innovations. 

The increased performance of the NVIDIA Blackwell platform not only accelerates model training speed, but the increases in performance also significantly outpace the increase in hourly instance pricing, translating into significant performance-per-dollar gains. 

A bar chart showing calculated performance-per-dollar, with GB200 NVL72 at 1.9x H100.
Figure 2. GB200 NVL72 demonstrates nearly 2x performance per dollar  compared to Hopper on the latest MLPerf Training results on the Llama 3.1 405B benchmark.

MLPerf Training v5.0, and v5.1, closed division. Results from entries: 5.0-0014, 5.1-0072. Training performance per dollar is not a primary metric of MLPerf Training, and performance per dollar not verified by MLCommons. Performance per dollar is derived from MLPerf Training performance and published on-demand instance pricing.  The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

Using publicly available GPU rental prices and the most recent MLPerf Training submissions on Llama 3.1 405B using NVIDIA H100 and GB200 NVL72, respectively, GB200 NVL72 delivers almost 2x the performance per dollar of H100. 

How NVFP4 training unlocked even more performance and performance per dollar with same Blackwell GPUs

In addition to the large performance leaps that new architectures and platforms deliver as part of the NVIDIA annual roadmap rhythm, NVIDIA engineers are constantly looking for ways to deliver more performance from existing architectures through ongoing algorithmic and software innovations. 

The Blackwell architecture adds support for FP4 acceleration directly in hardware, including both industry FP4 formats as well as the NVIDIA-designed NVFP4 format, which helps improve performance compared to other FP4 formats. The use of NVFP4 training recipes in the most recent MLPerf Training v5.1 round enabled significant training performance improvements on the same GB200 NVL72 rack-scale architecture compared to FP8 submissions in the prior round—up to 1.4x higher performance at similar scale.

A bar chart showing GB200 NVL72 performance on Llama 3.1 405B in MLPerf Training v5.0 and v5.1, with v5.0 as baseline at 1x and v5.1 at 1.4x.
Figure 3. With continued software improvements on Blackwell, including use of NVFP4 training recipes, Blackwell performance improved by up to 1.4x compared to the prior MLPerf Training round.

MLPerf Training v5.0, and v5.1, closed division. Results from entries: 5.0-0067, 5.1-0072. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

This performance improvement yields not only significantly faster training, but—since that improvement is on the same GPUs—translates into higher performance per dollar. 

Blackwell Ultra delivers another big performance boost

The NVIDIA GB300 NVL72, which features the upgraded NVIDIA Blackwell Ultra GPU, demonstrated further training speedups in MLPerf Training, fueled by significantly higher FP4 compute as well as larger high bandwidth memory (HBM). When comparing submissions at the 512-GPU scale, GB300 NVL72 completed the Llama 3.1 405B benchmark 1.9x faster than GB200 NVL72 at the same scale did last round, bringing the cumulative performance gain compared to Hopper up to 4.2x. 

A bar chart showing MLPerf Lllam 3.1 405B training performance at 512 GPUs. The performance includes H100 v5.0 at 1x, GB200 NVL72 v5.0 at 2.2x, GB200 NVL72 v5.1 at 3.2x, and GB300 NVL72 v5.1 at 4.2 x.
Figure 4. GB300 NVL72 delivers a cumulative performance improvement of more than 4x on the Llama 3.1 405B benchmark in MLPerf Training.

MLPerf Training v5.0, and v5.1, closed division. Results from entries: 5.0-0014  5.1-0060, 5.0-0067, 5.1-0072. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

GB300 NVL72 has now demonstrated large performance gains across both MLPerf Training and MLPerf Inference compared to the prior-generation GB200 NVL72, accelerated by the broadening application of the NVFP4 data format. This means that with GB300 NVL72, model makers can train their next-generation models faster and bring them to market sooner, as well as serve them with higher throughput and increase the revenue potential from serving models. 

NVIDIA extreme codesign at the speed of light

By innovating relentlessly across GPU, CPU, scale-up fabric, scale-out and scale-across networking, system architecture, and software, NVIDIA extreme codesign is delivering massive performance leaps each year. These gains are set to enable training of larger and smarter next-generation AI models, as well as to enable fast and cost-efficient serving of those models, bringing even more value to the broader AI ecosystem. 

To learn more about our latest MLPerf Training and Inference results, visit our MLPerf benchmark webpage, and check out our MLPerf Training v5.1 and MLPerf Inference v5.1 technical blogs.

Discuss (0)

Tags