Generative AI

Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training

A decorative image.

With the growth of large language models (LLMs), deep learning is advancing both model architecture design and computational efficiency. Mixed precision training, which strategically employs lower precision formats like brain floating point 16 (BF16) for computationally intensive operations while retaining the stability of 32-bit floating-point (FP32) where needed, has been a key strategy for accelerating training. Such adoption of lower precision numerical formats promises faster computation and reduced memory consumption without sacrificing model accuracy. Now, the exploration of finer-grained numerical formats, such as floating-point 8 (FP8), holds the promise of even greater efficiency without significant accuracy loss.

But how does FP8 work, and what makes it so effective? In this blog post, we’ll explore the fundamentals of FP8 training—its benefits, challenges, and common implementation approaches. We’ll also touch on the hardware architectures that support FP8 and share real-world success stories along with helpful resources.

FP8 format explanation

Modern LLMs need precision formats that balance computational efficiency with numerical stability. Although BF16 has long been the standard for efficient neural network training, the introduction of FP8 brings new, highly specialized formats that are finely tuned for the unique demands of different stages in a deep learning workflow. 

A key enabler of FP8 training’s speed and efficiency is the inclusion of dedicated FP8 Tensor Cores within the NVIDIA H100 architecture. 

Structures of the following floating-point data types (top to bottom): FP16, BF16, FP8 E4M3, FP8 E5M2..
Figure 1. Structure of the floating-point data types

FP8 splits into two variants:

  • E4M3, a format with 4 exponent 3 mantissa bits, prioritizes precision for forward passes, where weights and activations benefit from finer-grained values. Its approximate range of ±448, along with the ability to represent NaN, accommodates most layer outputs without overflow.
  • E5M2, short for 5 exponent bits and 2 mantissa bits, trades mantissa bits for a wider dynamic range (±57,344, ±inf, nan). This broader range is crucial for backward passes, where gradients can vary significantly in magnitude.

BF16’s 8 exponent and 7 mantissa bits offer a vast dynamic range (from 1e-38 to 1e38), which enables it to represent the distributions of weights, activations, and gradients without scaling factors. FP8’s double datatype (E4M3: with a range up to approximately ±448, and E5M2:±57344) coupled with scaling factors, enables more efficient hardware utilization compared to BF16 without sacrificing convergence.

Why FP8 outshines integer formats for LLM training 

While an 8-bit integer (INT8) also saves memory, its fixed-point nature requires predefined scaling factors that struggle to accommodate the unpredictable and often extreme dynamic ranges of activations and gradients within transformer architectures. This can result in clipping or significant quantization noise. 

Floating-point formats like FP8 overcome this by enabling each number to have its own implicit “scale” through the exponent. For instance, the wide dynamic range of exponentiated scores in attention mechanisms (from near-zero to thousands) often leads to significant errors with INT8’s inherently fixed scaling. Additionally, gradient propagation in deep neural networks, regardless of whether the model has millions or hundreds of billions of parameters, can produce extreme values that FP8’s floating-point exponents are well-suited to represent, whereas INT8 formats often struggle with such a wide dynamic range. 

For instance, attention mechanisms rely on exponentiated scores that span values from near-zero to thousands, a scenario where INT8’s rigid scaling factors introduce errors. Additionally, gradient propagation in deep neural networks, regardless of whether the model has millions or hundreds of billions of parameters, can produce extreme values that FP8’s floating-point exponents are well-suited to represent, whereas INT8 formats often struggle with such a wide dynamic range.

NVIDIA Blackwell introduces microscaling formats

A figure representing FP8, with a single scaling factor represented in FP32, and MXFP8, with multiple per-block scaling factors represented in E8M0.
Figure 2. FP8 and MXFP8 scaling factors

The latest NVIDIA Blackwell GPU architecture expands hardware support for low-precision numerical formats beyond FP8, including finer-grained sub-FP8 formats like FP4 and FP6, in addition to its enhanced FP8 Blackwell Tensor Cores.

The fundamental distinction between standard FP8 and MXFP8 lies in their granular scaling mechanism. Traditional FP8 applies a singular FP32 scaling factor across an entire tensor, which can limit representational accuracy for tensors exhibiting wide dynamic ranges, often necessitating the use of the lower-precision E5M2 format, particularly for gradients. In contrast, NVIDIA Blackwell MXFP8 implements a block-level scaling strategy. Specifically, each contiguous block of 32 values within a tensor (simplified to 4 values per block in the image for clarity) is assigned a distinct scaling factor, a process executed natively by the GPU’s Tensor Cores. 

This fine-grained scaling mitigates quantization errors by enabling different segments of the tensor to align more effectively with the FP8 dynamic range, preserving both high and low magnitude components. This enhanced control often enables the more widespread adoption of the higher-precision E4M3 format across a broader spectrum of tensor types.

Convergence and speedup

Quantization techniques like FP8 training can drastically accelerate both the training and inference of LLMs. This is due to the lower number of bits required to represent the tensor values. Fewer bits result in compute, memory, and bandwidth savings. Matrix multiplications require fewer instructions to execute. Values occupy less high bandwidth and cache memory, plus moving data from memory to cores and vice versa is faster.

However, as we remove more and more bits with quantization, there’s a risk of degrading the convergence of the LLM training. The reduced number of bits influences the mantissa and exponent of the floating-point values, reducing the dynamic range and precision. This can result in the outliers of the tensors not being represented correctly, leading to degradation or even divergence during ‌LLM training. Consequently, it is paramount to establish a balance between acceleration and convergence.

NVIDIA Transformer Engine handles the FP8 in NVIDIA Ada Lovelace and NVIDIA Hopper GPU series and MXFP8 in Blackwell optimally to minimize the degradation from quantization while at the same time achieving significant speedups. For example, Figure 3. compares the pretraining convergence of MXFP8 and the higher-precision BF16 for a 8B LLM.

A line graph comparing validation perplexity versus training tokens (in trillions) for BF16 (black) and MXFP8 (green) in the 8B Parameter Nemotron model; both lines decrease similarly as tokens increase.
Figure 3. Validation perplexity of BF16 training versus MXFP8

The plot in Figure 3 displays how the perplexity of MXFP8 follows closely that of BF16, indicating that MXFP8 converges as well as BF16. 

Recipes overview

A line chart showing validation loss versus training steps for Nemotron 8B, comparing Tensor-wise dynamic scaling (dark blue) and BF16 (light blue); both lines closely track each other, with validation loss values around 1.02 to 1.05.
Figure 4. Validation loss while comparing BF16 with FP8 training–Nemotron 8B

When training large neural networks with FP8, how we handle the scaling between the higher precision (like FP32) and the lower precision FP8 format is crucial for both performance and accuracy. These scaling strategies are often bundled into what we call “FP8 recipes.” Let’s explore two main categories: tensor scaling and block scaling.

Tensor scaling

In tensor scaling, a single scaling factor is determined and applied to all the elements within a given tensor. This approach is simpler but must be carefully managed due to the potential for wide dynamic ranges within a tensor.

Delayed scaling 

Delayed scaling is a strategy where the scaling factor used in the current training iteration isn’t calculated based on the immediate values of the tensor. Instead, it uses a history of the maximum absolute values (often called “amax”) observed for that tensor over several previous training steps. The system keeps track of the amax values seen in the last few iterations (the “amax history”). To determine the scaling factor for the current step, an algorithm (which could be as simple as taking the maximum amax from the history or something more sophisticated) is applied to this history. The calculated scaling factor is then used to convert the higher-precision tensor to FP8 for computation. After the FP8 operation, the current amax of the resulting tensor is recorded and added to the history for future scaling decisions.

Per-tensor current-scaling

This method determines the scaling factor for each tensor based on the statistical properties of that tensor in the current forward or backward pass. In contrast to delayed scaling, per-tensor current scaling determines the scaling factor in the current training iteration. This method is more reactive to the immediate dynamic range of the tensor, potentially leading to more accurate quantization in each step. 

Unlike delayed scaling, where transient spikes in training can contaminate the amax history and potentially lead to divergence, per-tensor current scaling dynamically adjusts the scale based on the present data range. This immediate responsiveness helps optimize the FP8 representation and has been observed to improve model convergence during training.

Block scaling

Block scaling takes a more granular approach than tensor scaling. Instead of applying a single scaling factor to the entire tensor, it divides the tensor into smaller, contiguous blocks and assigns a separate scaling factor to each block.

MXFP8 

With the NVIDIA Blackwell architecture, MXFP8 exemplifies block scaling. Here, a tensor is divided into blocks of 32 consecutive values. Each of these 32-element blocks gets its own dedicated, power-of-2 scaling factor (E8M0 format). This scaling is handled directly at the hardware level by Blackwell Tensor Cores. 

By having a scaling factor for each small block, MXFP8 can better accommodate variations in magnitude within a single tensor. Regions with large values and regions with small values can be scaled more appropriately, leading to a more accurate representation in FP8. 

General FP8 

While MXFP8 has a fixed block size (32), the more general FP8 block scaling enables configurable block dimensions. A tensor can be divided into blocks of various sizes (such as 1×128 or 128×128), where all values within a block share the same scaling factor. These scaling factors are typically stored in FP32.

In block scaling, the resulting tensor and its transpose aren’t numerically identical, necessitating re-computation from higher precision when both are required.

Memory implications

In NVIDIA Transformer Engine, scaling factors for FP8 training are stored internally within each module that utilizes FP8 precision, such as linear layers in transformer architectures. For standard FP8, each tensor is associated with a single scaling factor, which is stored as a 32-bit floating point (FP32) value. This scaling factor ensures that the dynamic range of the tensor’s values fits within the representable range of the FP8 format. 

In contrast, the MXFP8 variant employs block-wise scaling, where a separate scaling factor represented in an 8-bit exponent-only format (E8M0) is assigned to each block of 32 consecutive values, further optimizing memory usage. 

These scaling factors, along with amax histories used for dynamic adjustment, are managed automatically by the Transformer Engine’s internal buffers and are updated during training iterations. When saving checkpoints, the FP8 metadata, including scaling factors and amax histories, is stored under a dedicated key to ensure reproducibility and continuity across training sessions.

Conclusion

In this post, we explored the fundamentals of FP8 precision and the recipes that make efficient large-scale training possible, from tensor and block scaling to the latest hardware innovations. By understanding the memory implications and practical considerations of each approach, practitioners can unlock significant speed and efficiency gains without compromising model quality.

Ready to optimize your LLM training? Explore the latest FP8 recipes in NVIDIA Transformer Engine.

Learn more 

At NVIDIA GTC 2025, several organizations showcased how they’re using FP8 precision to accelerate continual pre-training of LLMs while maintaining high accuracy. Explore the following resources for in-depth case studies, technical details, and practical insights:

  • Case Study: iGenius and NVIDIA DGX Cloud
    Discover how iGenius leveraged FP8 precision for continual pretraining of the Colosseum 355B LLM, achieving 82.04% accuracy on the MMLU benchmark with significant speed and memory improvements. The team details strategies for FP8 stability, such as early learning rate reduction and selective use of BF16 for sensitive layers.
  • GTC 2025 session: From FP8 LLM Training to Inference
    Learn how DeepL and others use FP8 for next-generation translation models, covering production deployment and lessons learned from large-scale FP8 training. 
  • GTC 2025 session: Building LLMs: Accelerating Pretraining of Foundational Models With FP8 Precision
    Explore how Zoho’s shift from the Fairseq framework to NVIDIA NeMo, along with the adoption of FP8 precision and NVIDIA H100 GPUs, improved LLM training speed, efficiency, and model quality. These innovations reduced pre-training costs and time while setting a new benchmark for operational excellence in LLM development.
Discuss (0)

Tags