Agentic AI / Generative AI

Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training

Jun 04, 2025

By Karin Sevegnani, Giuseppe Fiameni, Utkarsh Uppal, Sergio Perez and Andrea Pilzer

Discuss (0)

AI-Generated Summary

Dislike

The FP8 format offers two variants, E4M3 and E5M2, which provide a balance between precision and dynamic range, making it suitable for different stages of deep learning workflows, such as forward and backward passes.
NVIDIA's latest Blackwell GPU architecture supports FP8 and introduces microscaling formats like MXFP8, which implements a block-level scaling strategy to mitigate quantization errors and enable more accurate representation of tensors.
FP8 training can significantly accelerate large language model (LLM) training and inference by reducing computational requirements and memory consumption, while NVIDIA Transformer Engine handles FP8 and MXFP8 optimally to minimize degradation and achieve significant speedups.

AI-generated content may summarize information incompletely. Verify important information. Learn more

With the growth of large language models (LLMs), deep learning is advancing both model architecture design and computational efficiency. Mixed precision training, which strategically employs lower precision formats like brain floating point 16 (BF16) for computationally intensive operations while retaining the stability of 32-bit floating-point (FP32) where needed, has been a key strategy for accelerating training. Such adoption of lower precision numerical formats promises faster computation and reduced memory consumption without sacrificing model accuracy. Now, the exploration of finer-grained numerical formats, such as floating-point 8 (FP8), holds the promise of even greater efficiency without significant accuracy loss.

But how does FP8 work, and what makes it so effective? In this blog post, we’ll explore the fundamentals of FP8 training—its benefits, challenges, and common implementation approaches. We’ll also touch on the hardware architectures that support FP8 and share real-world success stories along with helpful resources.

FP8 format explanation

Modern LLMs need precision formats that balance computational efficiency with numerical stability. Although BF16 has long been the standard for efficient neural network training, the introduction of FP8 brings new, highly specialized formats that are finely tuned for the unique demands of different stages in a deep learning workflow.

A key enabler of FP8 training’s speed and efficiency is the inclusion of dedicated FP8 Tensor Cores within the NVIDIA H100 architecture.

FP8 splits into two variants:

E4M3, a format with 4 exponent 3 mantissa bits, prioritizes precision for forward passes, where weights and activations benefit from finer-grained values. Its approximate range of ±448, along with the ability to represent NaN, accommodates most layer outputs without overflow.

E5M2, short for 5 exponent bits and 2 mantissa bits, trades mantissa bits for a wider dynamic range (±57,344, ±inf, nan). This broader range is crucial for backward passes, where gradients can vary significantly in magnitude.

BF16’s 8 exponent and 7 mantissa bits offer a vast dynamic range (from 1e-38 to 1e38), which enables it to represent the distributions of weights, activations, and gradients without scaling factors. FP8’s double datatype (E4M3: with a range up to approximately ±448, and E5M2:±57344) coupled with scaling factors, enables more efficient hardware utilization compared to BF16 without sacrificing convergence.

Why FP8 outshines integer formats for LLM training

While an 8-bit integer (INT8) also saves memory, its fixed-point nature requires predefined scaling factors that struggle to accommodate the unpredictable and often extreme dynamic ranges of activations and gradients within transformer architectures. This can result in clipping or significant quantization noise.

Floating-point formats like FP8 overcome this by enabling each number to have its own implicit “scale” through the exponent. For instance, the wide dynamic range of exponentiated scores in attention mechanisms (from near-zero to thousands) often leads to significant errors with INT8’s inherently fixed scaling. Additionally, gradient propagation in deep neural networks, regardless of whether the model has millions or hundreds of billions of parameters, can produce extreme values that FP8’s floating-point exponents are well-suited to represent, whereas INT8 formats often struggle with such a wide dynamic range.

For instance, attention mechanisms rely on exponentiated scores that span values from near-zero to thousands, a scenario where INT8’s rigid scaling factors introduce errors. Additionally, gradient propagation in deep neural networks, regardless of whether the model has millions or hundreds of billions of parameters, can produce extreme values that FP8’s floating-point exponents are well-suited to represent, whereas INT8 formats often struggle with such a wide dynamic range.

NVIDIA Blackwell introduces microscaling formats

The latest NVIDIA Blackwell GPU architecture expands hardware support for low-precision numerical formats beyond FP8, including finer-grained sub-FP8 formats like FP4 and FP6, in addition to its enhanced FP8 Blackwell Tensor Cores.

The fundamental distinction between standard FP8 and MXFP8 lies in their granular scaling mechanism. Traditional FP8 applies a singular FP32 scaling factor across an entire tensor, which can limit representational accuracy for tensors exhibiting wide dynamic ranges, often necessitating the use of the lower-precision E5M2 format, particularly for gradients. In contrast, NVIDIA Blackwell MXFP8 implements a block-level scaling strategy. Specifically, each contiguous block of 32 values within a tensor (simplified to 4 values per block in the image for clarity) is assigned a distinct scaling factor, a process executed natively by the GPU’s Tensor Cores.

This fine-grained scaling mitigates quantization errors by enabling different segments of the tensor to align more effectively with the FP8 dynamic range, preserving both high and low magnitude components. This enhanced control often enables the more widespread adoption of the higher-precision E4M3 format across a broader spectrum of tensor types.

Convergence and speedup

Quantization techniques like FP8 training can drastically accelerate both the training and inference of LLMs. This is due to the lower number of bits required to represent the tensor values. Fewer bits result in compute, memory, and bandwidth savings. Matrix multiplications require fewer instructions to execute. Values occupy less high bandwidth and cache memory, plus moving data from memory to cores and vice versa is faster.

However, as we remove more and more bits with quantization, there’s a risk of degrading the convergence of the LLM training. The reduced number of bits influences the mantissa and exponent of the floating-point values, reducing the dynamic range and precision. This can result in the outliers of the tensors not being represented correctly, leading to degradation or even divergence during ‌LLM training. Consequently, it is paramount to establish a balance between acceleration and convergence.

NVIDIA Transformer Engine handles the FP8 in NVIDIA Ada Lovelace and NVIDIA Hopper GPU series and MXFP8 in Blackwell optimally to minimize the degradation from quantization while at the same time achieving significant speedups. For example, Figure 3. compares the pretraining convergence of MXFP8 and the higher-precision BF16 for a 8B LLM.

The plot in Figure 3 displays how the perplexity of MXFP8 follows closely that of BF16, indicating that MXFP8 converges as well as BF16.

Recipes overview

When training large neural networks with FP8, how we handle the scaling between the higher precision (like FP32) and the lower precision FP8 format is crucial for both performance and accuracy. These scaling strategies are often bundled into what we call “FP8 recipes.” Let’s explore two main categories: tensor scaling and block scaling.

Tensor scaling

In tensor scaling, a single scaling factor is determined and applied to all the elements within a given tensor. This approach is simpler but must be carefully managed due to the potential for wide dynamic ranges within a tensor.

Delayed scaling

Delayed scaling is a strategy where the scaling factor used in the current training iteration isn’t calculated based on the immediate values of the tensor. Instead, it uses a history of the maximum absolute values (often called “amax”) observed for that tensor over several previous training steps. The system keeps track of the amax values seen in the last few iterations (the “amax history”). To determine the scaling factor for the current step, an algorithm (which could be as simple as taking the maximum amax from the history or something more sophisticated) is applied to this history. The calculated scaling factor is then used to convert the higher-precision tensor to FP8 for computation. After the FP8 operation, the current amax of the resulting tensor is recorded and added to the history for future scaling decisions.

Per-tensor current-scaling

This method determines the scaling factor for each tensor based on the statistical properties of that tensor in the current forward or backward pass. In contrast to delayed scaling, per-tensor current scaling determines the scaling factor in the current training iteration. This method is more reactive to the immediate dynamic range of the tensor, potentially leading to more accurate quantization in each step.

Unlike delayed scaling, where transient spikes in training can contaminate the amax history and potentially lead to divergence, per-tensor current scaling dynamically adjusts the scale based on the present data range. This immediate responsiveness helps optimize the FP8 representation and has been observed to improve model convergence during training.

Block scaling

Block scaling takes a more granular approach than tensor scaling. Instead of applying a single scaling factor to the entire tensor, it divides the tensor into smaller, contiguous blocks and assigns a separate scaling factor to each block.

MXFP8

With the NVIDIA Blackwell architecture, MXFP8 exemplifies block scaling. Here, a tensor is divided into blocks of 32 consecutive values. Each of these 32-element blocks gets its own dedicated, power-of-2 scaling factor (E8M0 format). This scaling is handled directly at the hardware level by Blackwell Tensor Cores.

By having a scaling factor for each small block, MXFP8 can better accommodate variations in magnitude within a single tensor. Regions with large values and regions with small values can be scaled more appropriately, leading to a more accurate representation in FP8.

General FP8

While MXFP8 has a fixed block size (32), the more general FP8 block scaling enables configurable block dimensions. A tensor can be divided into blocks of various sizes (such as 1×128 or 128×128), where all values within a block share the same scaling factor. These scaling factors are typically stored in FP32.

In block scaling, the resulting tensor and its transpose aren’t numerically identical, necessitating re-computation from higher precision when both are required.

Memory implications

In NVIDIA Transformer Engine, scaling factors for FP8 training are stored internally within each module that utilizes FP8 precision, such as linear layers in transformer architectures. For standard FP8, each tensor is associated with a single scaling factor, which is stored as a 32-bit floating point (FP32) value. This scaling factor ensures that the dynamic range of the tensor’s values fits within the representable range of the FP8 format.

In contrast, the MXFP8 variant employs block-wise scaling, where a separate scaling factor represented in an 8-bit exponent-only format (E8M0) is assigned to each block of 32 consecutive values, further optimizing memory usage.

These scaling factors, along with amax histories used for dynamic adjustment, are managed automatically by the Transformer Engine’s internal buffers and are updated during training iterations. When saving checkpoints, the FP8 metadata, including scaling factors and amax histories, is stored under a dedicated key to ensure reproducibility and continuity across training sessions.

Conclusion

In this post, we explored the fundamentals of FP8 precision and the recipes that make efficient large-scale training possible, from tensor and block scaling to the latest hardware innovations. By understanding the memory implications and practical considerations of each approach, practitioners can unlock significant speed and efficiency gains without compromising model quality.

Ready to optimize your LLM training? Explore the latest FP8 recipes in NVIDIA Transformer Engine.

Learn more

At NVIDIA GTC 2025, several organizations showcased how they’re using FP8 precision to accelerate continual pre-training of LLMs while maintaining high accuracy. Explore the following resources for in-depth case studies, technical details, and practical insights:

Case Study: iGenius and NVIDIA DGX Cloud
Discover how iGenius leveraged FP8 precision for continual pretraining of the Colosseum 355B LLM, achieving 82.04% accuracy on the MMLU benchmark with significant speed and memory improvements. The team details strategies for FP8 stability, such as early learning rate reduction and selective use of BF16 for sensitive layers.
GTC 2025 session: From FP8 LLM Training to Inference
Learn how DeepL and others use FP8 for next-generation translation models, covering production deployment and lessons learned from large-scale FP8 training.
GTC 2025 session: Building LLMs: Accelerating Pretraining of Foundational Models With FP8 Precision
Explore how Zoho’s shift from the Fairseq framework to NVIDIA NeMo, along with the adoption of FP8 precision and NVIDIA H100 GPUs, improved LLM training speed, efficiency, and model quality. These innovations reduced pre-training costs and time while setting a new benchmark for operational excellence in LLM development.

Discuss (0)

About the Authors

About Karin Sevegnani
Karin Sevegnani is a senior solutions architect at NVIDIA, where she leads the NVIDIA AI Technology Centre (NVAITC) in the UK. In this role, she drives collaborations with higher education institutions and research organizations to advance AI innovation and adoption. Before joining NVIDIA, Karin worked as a research engineer in Edinburgh, applying her expertise in AI development. She holds a PhD in Conversational AI from a joint program between the University of Edinburgh and Heriot-Watt University, completed under a combined scholarship and degree arrangement. Her specialization lies in natural language processing (NLP), particularly in conversational AI systems.

View all posts by Karin Sevegnani

About Giuseppe Fiameni
Giuseppe Fiameni is a senior solutions architect specialized in AI and accelerated computing at NVIDIA, where he manages the NVIDIA AI Technology Center (NVAITC) program in EMEA. He holds a PhD in Computer Engineering from the University of Modena e Reggio Emilia where he is also co-lecturer of the Scalable AI master course, which provides comprehensive insights into the challenges and techniques of training and deploying large-scale AI models. He has extensive experience in high-performance computing, having previously worked for CINECA, the Italian supercomputing center. His research focuses on AI, HPC, and scientific data analysis, and he has contributed to numerous publications in these fields.

View all posts by Giuseppe Fiameni

About Utkarsh Uppal
Utkarsh Uppal is a senior applied deep learning solutions architect at NVIDIA, where he specializes in building high-performance deep learning pipelines across domains like language and speech. His primary focus is on developing end-to-end conversational AI systems, including training LLMs from scratch, particularly for Indic languages and building domain-specific models with enterprises. He also has deep expertise in designing and optimizing inference architectures for production, with a focus on low-precision formats (FP4, FP8), decoding strategies, and KV-cache optimizations.

View all posts by Utkarsh Uppal

About Sergio Perez
Sergio Perez is a solution architect at NVIDIA who specializes in the training and inference of LLMs. Sergio works alongside AI developers in public supercomputer centers and sectors such as energy, automotive, finance, telecommunications, and internet services. He has contributed to production applications of LLMs covering RAG systems, optimization of inference servers, pretraining of LLMs from scratch, custom evaluation of LLMs, or quantization using FP8 formats. Sergio holds a Ph.D. in computational fluid dynamics from Imperial College London.

View all posts by Sergio Perez

About Andrea Pilzer
Andrea Pilzer is an HER solution architect at NVIDIA leading the NVIDIA AI Technology Center in Italy. Running collaboration projects with HER institutions to accelerate AI research. His main interests are in deep learning, video processing, VLMs and uncertainty estimation. He was a postdoc at Aalto University working on uncertainty estimation for deep learning, worked at Huawei Ireland and got his Ph.D. in CS from University of Trento.

View all posts by Andrea Pilzer