NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit

In recent years, AI workloads have grown exponentially—not only in the deployment of large language models (LLMs) but also in the demand to process ever more tokens during pretraining and post-training. As organizations scale up compute infrastructure to train and deploy multi-billion-parameter foundation models, the ability to sustain higher token throughput has become mission critical. Progress is increasingly defined not just by efficiency, but by how many tokens an AI factory can push through to unlock the next wave of model capabilities.

AI-optimized data formats have emerged as a key innovation in this effort. Narrow-precision computation has already transformed inference, with NVIDIA’s introduction to NVFP4, a 4-bit format purpose-built to deliver exceptional inference latency, throughput, and efficiency—all while maintaining production-grade accuracy.

Now, NVIDIA is extending this innovation to the pretraining phase, marking a major leap forward in LLM development. Using NVFP4 for pretraining unlocks huge improvements in training LLMs at scale and overall infrastructure efficiency. This isn’t just an incremental optimization—it’s a foundational shift in how large models can be trained at scale.

In the era of AI factories, where compute is the engine of progress, precision is no longer a backend detail—it’s a strategic advantage. NVFP4 4-bit pretraining redefines the boundaries of efficiency and scalability, setting a new standard for high-performance AI model development.

NVFP4 training is still in the research phase, exploring and validating the potential of 4-bit precision in large-scale model pretraining. Active engagements and continued collaboration around NVFP4 are ongoing with leading organizations such as Amazon Web Services, Cohere, Google Cloud, Kimi AI, Microsoft AI, Mistral, OpenAI, Perplexity, Reflection, and Runway.

What is 4-bit quantization?

4-bit quantization refers to the process of reducing the precision of model weights and activations to just 4 bits—a dramatic drop from the typical 16-bit or 32-bit floating-point formats.

Pretraining with 4 bits is challenging because gradients and updates must be handled very carefully to preserve accuracy while improving the overall training speed. Specialized techniques and recipes are required to maintain effectiveness while mapping high-precision tensors to a much smaller set of quantized values.

How fewer bits unlock more capability for AI factories

In recent years, AI workloads have grown exponentially—not just in the deployment of large language models (LLMs) but also in the scale of foundation model pretraining and post-training. As organizations expand compute infrastructure to handle training and deployment of multi-billion-parameter models, progress is increasingly defined by how much token throughput an AI factory can sustain to unlock new capabilities.

Inference has already undergone multiple waves of innovation, from FP32 and FP16 down to FP8 and most recently, NVIDIA’s release of NVFP4 for AI inference. While methods like post-training quantization (PTQ) have shown NVFP4 to be a force multiplier in increasing inference throughput while maintaining accuracy, a remaining challenge lies upstream in pretraining—where foundation models still rely on BF16 or FP8 for stability and convergence.

Training is where AI factories can spend the bulk of their compute, power, and time. Power budgets are fixed and GPU cycles are scarce, so developers must account for every bit, token, and epoch. Throughput isn’t an abstract metric here—it directly determines what scale of models can be built, how many experiments can be run, and how quickly breakthroughs arrive.

This is where 4-bit precision becomes transformative. By cutting memory needs, boosting arithmetic throughput, and optimizing communication, 4-bit pretraining allows factories to push significantly more tokens through the same hardware. With the right quantization recipe, it can deliver accuracy on par with FP8/BF16 while dramatically raising throughput—unlocking faster convergence cycles, more experiments per unit of compute, and scaling to unprecedented frontier models. In other words, fewer bits don’t just save money—they expand the frontier of what AI factories can achieve.

The NVFP4 quantization recipe for pretraining

To enable pretraining at 4-bit precision, we’ve developed a purpose-built NVFP4 pretraining recipe that addresses the core challenges of dynamic range, gradient volatility, and numerical stability in large-scale training.

Blackwell was the first architecture from NVIDIA to natively support FP4 formats. The massive FP4 FLOPs throughput on GB200 and GB300 enables efficient 4-bit training by accelerating narrow-precision matrix operations while maintaining the scale and parallelism needed for large model convergence—making them ideal for next-generation AI factories deploying FP4-based pretraining.

Figure 1 below shows measured GEMM performance with Blackwell Ultra, revealing a 7x speedup over the Hopper generation. Modern LLMs fundamentally rely on matrix multiplication, particularly within their fully-connected or linear layers, as a core computational element. This makes the efficiency of these operations crucial. With FP4 precision enabling faster and more efficient execution of these operations, the observed GEMM acceleration means the entire pretraining process—from forward propagation to gradient updates—runs significantly faster, reducing time-to-train while enabling rapid larger-scale model development.

Bar chart comparing measured GEMM performance across Hopper, GB200, and GB300, with GB300 showing a 7x speedup over Hopper. It highlights significant acceleration in matrix multiplication for LLM training. — *Figure 1. Measured GEMM performance shows GB300 delivering a 7x speedup over Hopper, accelerating core LLM training operations through faster FP4-optimized matrix multiplications.*

To enable efficient narrow-precision training, NVIDIA’s NVFP4 pretraining recipe leverages several key techniques which have been chosen based on their performance and accuracy. These include:

Enhanced value representation with NVFP4’s micro-block scaling: Blackwell introduces native Tensor Core support for NVFP4, a 4-bit numerical format for both weights and activations that uses micro-block scaling—where each group of sixteen 4-bit elements shares a common scaling factor. By reducing the block size from 32 to 16 elements compared to MXFP4, NVFP4 minimizes the influence of outliers and enables more precise scaling. This finer granularity reduces quantization error and improves overall model accuracy.

NVFP4 high-precision block encoding with E4M3 scale factors: Scale factor precision plays a critical role in quantization quality and accuracy. Unlike MXFP4, which is limited to power-of-two scale factors (E8M0) and prone to high rounding errors, NVFP4 uses higher-precision E4M3 scale factors with additional mantissa bits. This allows finer-grain scaling, better utilization of the limited quantization bins, and more accurate representation of values within a block.

Reshaping tensor distributions to fit narrow formats: Gradients and activations during LLM pretraining tend to have large outliers that can impact narrow-precision quantization. Applying Hadamard transforms to GEMM inputs helps reshape their distribution to be more Gaussian-like, which smooths outliers and makes tensors easier to represent accurately. These transformations are transparent to the model architecture and can be applied to linear layers in the forward and backward pass.

Maintaining fidelity with quantization techniques: To ensure stable and efficient training, we employ quantization methods that preserve consistency between the forward and backward passes. Techniques such as selective 2D block-based quantization help maintain alignment in tensor representations throughout the training cycle. This consistency is key to minimizing signal distortion, improving convergence behavior, and enhancing overall robustness—especially when operating under narrow-precision formats like NVFP4.

Reducing bias with stochastic rounding: Unlike traditional (deterministic) rounding where gradients are always rounded toward the nearest representable number, stochastic rounding ensures that gradients are rounded up or down randomly, with probabilities proportional to how close a number lies between two representable values. This step is essential for reducing rounding bias, maintaining gradient flow during training, and ultimately improving model accuracy.

This image illustrates five key methods for efficient low-precision training in NVIDIA’s NVFP4 pretraining techniques. It is titled “NVFP4 Pretraining Techniques: 5 Key Methods for Efficient Low-Precision Training.” The highlights the following:
Micro-Block Scaling, which describes how NVFP4 uses 16-element micro-blocks (improving over MXFP4's 32), sharing scaling factors to minimize outlier influence and quantization error, ultimately improving model accuracy.
High-Precision Block Encoding, which explains that NVFP4 uses E4M3 scale factors rather than power-of-two scaling. This enables more mantissa bits, finer-grain scaling, and a more accurate representation of values within each block.
Reshaping Tensor Distributions, where random Hadamard transforms are applied to gradients and activations, reshaping distributions to be more Gaussian-like. This smooths outliers and improves quantization accuracy.
Quantization Fidelity, detailing the use of 2D block-based quantization to keep consistency between forward and backward passes, thus reducing signal distortion and improving convergence.
Stochastic Rounding, describing the replacement of deterministic rounding with probabilistic rounding based on value proximity, which reduces rounding bias, maintains gradient flow, and improves model accuracy.
At the bottom, the image includes the tagline: “Optimizing Large Language Model Training with Advanced Quantization Techniques — *Figure 2. NVIDIA’s NVFP4 pretraining techniques for efficient low-precision training*

NVFP4 Makes 4-Bit Pretraining Real: Accuracy and Stability at Trillion-Token Scale

For narrow-precision formats to be practical in large-scale pretraining, they must ensure both model accuracy and stable convergence. To assess the viability of 4-bit precision in large-scale model training, experiments were conducted with FP8 and NVFP4 on a 12-billion parameter model based on a combined Mamba-Transformer architecture (12B Hybrid Mamba-Transformer model)—similar to NVIDIA Nemotron Nano 2. This model was trained on a massive dataset of 10 trillion tokens using a phased data-blending approach, switching to a different dataset mix in the second phase of training at 70%, and in the third phase of training at 90% during pretraining.

A version of the 12B Hybrid Mamba-Transformer model was initially trained with 8-bit precision—FP8, which has been shown in previous studies to closely match 16-bit precision, and hence served as our baseline for comparison. We then successfully trained this same 12B model from scratch using NVFP4, demonstrating that this new low-precision format can support full pretraining at trillion-token scale. The NVFP4 run exhibited stable convergence without the training instabilities or divergence issues that typically plague ultra-low precision training.

Figure 3 below shows that NVFP4’s validation loss curve closely matches the loss curves from the higher-precision baseline (i.e., FP8) throughout the entire duration of training. The quantization techniques outlined above ensure that even with aggressive bit-width reduction, the 4-bit pretraining dynamics closely resemble those of higher-precision runs.

A line graph showing validation loss during pretraining of a 12B Hybrid Mamba-Transformer model model trained over 10 trillion tokens. The x-axis represents training tokens (in trillions), and the y-axis shows validation loss. Two lines, one for FP8 (blue) and one for NVFP4 (green), track closely together from 0 to 10 trillion tokens. The chart is divided into three phases across the x-axis, labeled Phase 1, Phase 2, and Phase 3. Both lines show similar trends throughout training, indicating that NVFP4 validation loss remains closely aligned with the FP8 baseline. — Figure 3. Validation loss comparison during pretraining of a 12B Hybrid Mamba-Transformer model using FP8 and NVFP4 precisions over 10T tokens shows NVFP4 loss curve closely aligns with the FP8 loss curve (baseline) throughout training.

We then took the 12B Hybrid Mamba-Transformer model pretrained using NVFP4 and compared it to the higher precision FP8 baseline across a range of downstream tasks and intelligence domains. Figure 4 illustrates that across all domains, NVFP4 matches the performance of FP8, highlighting its effectiveness. This finding strengthens the initial hypothesis: NVFP4 is a robust choice for pretraining LLMs even at the trillion-token scale—highlighting its potential for efficient large-scale frontier model training.

A bar chart comparing accuracy percentages across six intelligence domains (MMLU Pro, MMLU, code, math, Commonsense Understanding, Multilingual) for an internal 12B Hybrid Mamba-Transformer model model trained on 10 trillion tokens using FP8 precision (blue bars) and NVFP4 precision (green bars). For each domain, the FP8 and NVFP4 bars are nearly equal. The chart title reads "FP8 vs NVFP4: Benchmark Comparison Across Intelligence Domains." The data shows that NVFP4 closely matches FP8 accuracy across all domains. — Figure 4. Benchmarking downstream task accuracy scores on pretraining the 12B Hybrid Mamba-Transformer model using FP8 precision (baseline) and NVFP4 precision. Pretraining with NVFP4 achieves accuracy comparable with higher precision formats.

Train smarter, not just harder

NVIDIA’s NVFP4 format is redefining the landscape of AI training—setting a new benchmark for speed, efficiency, and purposeful innovation. By enabling 4-bit pretraining, NVFP4 empowers AI factories to scale more rapidly and sustainably, paving the way for the next era of generative AI. As a dynamic and evolving technology, NVFP4 continues to unlock new opportunities for teams building frontier models, driving progress in energy-efficient, high-performance AI. With its breakthrough in compute efficiency, 4-bit pretraining opens the door to more advanced architectures, larger training runs, and significantly more tokens—fueling the future of intelligent systems.