Data Center / Cloud

3 Ways NVFP4 Accelerates AI Training and Inference

The latest AI models continue to grow in size and complexity, demanding increasing amounts of compute performance for training and inference—far beyond what Moore’s Law can keep up with. That’s why NVIDIA engages in extreme codesign. Designing across multiple chips and a mountain of software cohesively enables large generational leaps in AI factory performance and efficiency.

Lower-precision AI formats are key to improving compute performance and energy efficiency. Bringing the benefits of ultra-low-precision numerics to AI training and inference while maintaining high accuracy requires extensive engineering across every layer of the technology stack. It spans the creation of the formats, implementation in silicon, enablement across many libraries, and working closely with the ecosystem to deploy new training recipes and inference optimization techniques. NVFP4, developed and implemented for NVIDIA GPUs starting with NVIDIA Blackwell, delivers the performance and energy-efficiency benefits of 4-bit floating-point precision while maintaining accuracy on par with higher-precision formats.

For those looking to maximize AI training and inference performance, here are three things to know about NVFP4.

1. NVFP4 enables large performance leaps for training and inference on the Blackwell architecture—and beyond

NVIDIA Blackwell Ultra GPUs provide peak dense NVFP4 throughput up to 15 petaFLOPS—3x that of FP8 on the same GPUs. The gains aren’t just about peak specs; they’re visible in measured performance of training and inference workloads. 

For inference, as shown in a recent technical blog post, moving from FP8 to NVFP4 leads to dramatic improvements in delivered token throughput at a given level of interactivity on DeepSeek-R1, a popular, 671B parameter, mixture-of-experts (MoE) model. The throughput increases at a given token rate and even higher token rates, enabling better user experiences.

A chart plotting per-user interactivity on the x-axis and token throughput per GPU on the y-axis. With the progression from FP8 MTP Off (light gray) to FP8 with MTP On (darker gray) to NVFP4 with MTP On (green), the curves continue to shift to the right, indicating more throughput at a given interactivity level and enabling higher peak interactivity.
Figure 1. Throughput-versus-interactivity curves across FP8 without MTP, FP8 with MTP, and NVFP4 with MTP on HGX B200, with 8K/1K sequence length and aggregated serving

NVIDIA also recently published an NVFP4 training recipe, bringing the significant performance benefits of NVFP4 to model training, enabling model makers to train AI faster and at lower cost. 

Two sets of bar charts, with performance starting with Hopper submissions in prior rounds, followed by Blackwell GB200 NVL72 submissions in v5.0, then finally Blackwell Ultra GB300 NVL72 submissions in v5.1. The speedups listed for Llama 3.1 405B are 1x, ~2x, and 4x+, and 1x, ~3x, and ~5x for Llama 2 70B LoRA, respectively.
Figure 2. Relative Llama 3.1 405B pretraining and Llama 2 70B LoRA fine-tuning performance at 512-GPU and 8-GPU scales, respectively

In the latest version of the MLPerf Training benchmark suite, multiple NVIDIA GB300 NVL72 systems—totaling 512 Blackwell Ultra GPUs—worked together using NVFP4 precision to complete the Llama 3.1 405B pre-training benchmark in 64.6 minutes. This is 1.9x faster than 512 Blackwell GPUs across multiple NVIDIA GB200 NVL72 systems, which were able to complete the benchmark using FP8 in the prior round. 

Looking ahead, the NVIDIA Rubin platform delivers large leaps in NVFP4 capability for training and inference, offering 35 petaFLOPS of NVFP4 training compute, and 50 petaFLOPs of NVFP4 Transformer Engine inference compute. This is a 3.5x and 5x leap compared to Blackwell, respectively. 

2. NVFP4 delivers great accuracy, proven on industry benchmarks

For MLPerf Training and Inference submissions in the closed division to be valid, they must meet accuracy requirements specified by the benchmarks. For inference, responses must meet certain accuracy thresholds, and for training, the models must be trained to specific quality targets (ie, the model training process must converge). 

NVIDIA successfully submitted results in the closed division on every large language model (LLM) test using NVFP4 on Blackwell and Blackwell Ultra GPUs in the latest version of MLPerf Training. And, NVIDIA has submitted across many models and scenarios using NVFP4 in MLPerf Inference. This included DeepSeek-R1, Llama 3.1 8B and 405B, and Llama 2 70B. NVIDIA used NVFP4-quantized versions of the models, all while meeting strict benchmark requirements. 

 Bar chart showing accuracy scores on the DeepSeek-R1 0528 model, with FP8 baseline and with NVFP4. MMLU-PRO, GPQA Diamond, HLE, and LIVECODEBENCH NVFP4 accuracy is within 1% of the FP8 baseline, SCICODE and Math-500 are the same, and on AIME 2024, NVFP4 is 2% lower.
Figure 3. DeepSeek-R1 Model Evaluation Scores showing NVFP4 closely matches the accuracy of the FP8 baseline

3. NVFP4 enjoys broad and growing ecosystem support

Libraries like NVIDIA Model Optimizer, LLM Compressor, and torch.ao enable developers to quantize models trained at higher precision to NVFP4 and implement NVFP4 KV cache to support long context and large batch sizes while preserving accuracy. Popular inference frameworks, including NVIDIA TensorRT-LLM, vLLM, and SGLang, also support running models in NVFP4 format today with models available in NVFP4 variants. For example, on HuggingFace, developers can find ready-to-deploy NVFP4 versions such as Llama 3.3 70B, FLUX.2, DeepSeek-R1-0528, Kimi-K2-Thinking, Qwen3-235B-A22B, and NVIDIA Nemotron Nano.

The ecosystem is also adopting NVFP4 to increase inference throughput in production across a variety of models. Those companies include Black Forest Labs, Radical Numerics, and Cognition. 

Black Forest Labs worked with NVIDIA to scale NVFP4 inference for FLUX.2 on Blackwell. “By layering optimizations like CUDA Graphs, torch.compile, NVFP4 precision, and TeaCache, we achieve up to a 6.3x speedup on a single B200—dramatically reducing latency and enabling more efficient production deployment,” said Robin Rombach, co-founder and CEO of Black Forest Labs.

Radical Numerics has leveraged NVFP4 to accelerate scientific world model scaling. “Unlike language, scientific data pushes us beyond the classical single-modality autoregressive recipe, demanding extremely long-context methods and robust multimodal fusion,” said Michael Poli, co-founder and chief AI scientist at Radical Numerics. He added the company is “highly optimistic” about using low-precision recipes to pretrain and post-train its new architecture.

And Cognition is seeing “significant latency and throughput gains” by using NVFP4 in large-scale reinforcement learning, said Steven Cao, a member of Cognition’s research team.

The NVIDIA Transformer Engine library incorporates an implementation of the NVFP4 training recipe, and training frameworks such as Megatron-Bridge have implementations for developers to get started. NVIDIA also continues to innovate and collaborate with the ecosystem to bring the performance and efficiency benefits of NVFP4 training to the entire ecosystem, paving the way to smarter, more complex models trained faster and more efficiently.

Learn more

Using NVFP4 can deliver large performance gains on both the NVIDIA Blackwell and NVIDIA Rubin platforms. Through extreme codesign, these large performance gains can also be achieved with excellent accuracy for both model training and inference. NVFP4 versions of popular open LLMs are widely available, enabling services to run these models with higher throughput and at a lower cost per million tokens.  

Learn more about how the significant architectural leaps enabled by the Rubin platform, including enhanced NVFP4, enable new levels of performance of AI training and inference.

Discuss (0)

Tags