Data Center / Cloud

Making Softmax More Efficient with NVIDIA Blackwell Ultra

Feb 25, 2026

By Jamie Li, Alexander Zhurkevich, Vedaanta Agarwalla, Seonghee Lee and Roman Anders

Discuss (0)

AI-Generated Summary

Dislike

Blackwell Ultra architecture from NVIDIA doubles Special Function Unit (SFU) throughput for exponentials, directly addressing the softmax bottleneck in large language model attention mechanisms and significantly reducing pipeline stalls between matrix multiplications.
Synthetic micro-benchmarks demonstrate that NVIDIA GB300 GPUs achieve approximately 2x higher MUFU.EX2 FLOPs performance over GB200 across all data types, resulting in a ~35% increase in FP8 forward propagation throughput for models such as DeepSeek-V3.
This advance shifts inference performance limits from matrix math to non-linear SFU operations, making hardware-software co-design techniquesincluding LDTM.STAT offloading, CUDNN optimization, and NVFP4 KVCache managementcritical for maximizing attention block efficiency in next-generation AI systems.

AI-generated content may summarize information incompletely. Verify important information. Learn more

LLM context lengths are exploding, and architectures are moving toward complex attention schemes like Multi-Head Latent Attention (MLA) and Grouped Query Attention (GQA). As a result, AI ”speed of thought” is increasingly governed not by the massive throughput of matrix multiplications, but by the transcendental math of the softmax function.

Transcendentals refer to functions that cannot be expressed as the root of a polynomial equation with rational coefficients. Subsequently, they “transcend” basic algebraic operations like addition and multiplication—the exact operations Tensor Cores excel at. In the specific context of softmax, the most computationally expensive of these transcendentals is the natural exponential function that is executed on Special Function Units (SFUs). In NVIDIA assembly instructions (SASS), this function is invoked via the MUFU.EX2 instruction. This architectural split creates a softmax bottleneck within the attention block, when powerful matrix engines are forced to idle while waiting for the SFU datapaths to normalize attention scores.

NVIDIA Blackwell Ultra alleviates this bottleneck by doubling SFU throughput over the standard NVIDIA Blackwell architecture.

This blog dives into the mechanics of softmax within the attention loop, explores how Blackwell Ultra’s hardware optimizations eliminate pipeline stalls, and provides a benchmark for you to measure the raw MUFU.EX2 speedup for yourself.

How attention works

A foundational component of modern large language models is the attention mechanism, which allows a model to dynamically transform static token vectors into dynamic, context-aware representations. At its core, it is a process of re-weighting information by allowing tokens to adjust their importance to one another. To facilitate this interaction, every token in a sequence is projected into three functional roles:

Query: Represents what the current token is seeking to understand its own context.
Key: Represents a token’s profile that others use for matching. Tokens previous in the sequence have keys that signal their specific relevance to the query.
Value: This holds the actual informational content. Once a match is confirmed between a query and a key, the Value is the specific data that is transferred to the original token.

Figure 1 below shows attention in action. We have two sentences that utilize the word “dog” in two different definitions. Initially, we can see that the embeddings (the numerical vectors that capture meaning and nuance in a multidimensional space) of both “dog” mentions are identical.

Attention operates with the model calculating a dot product between the “dog“ query and the keys of every other token in the sequence.

if the query for “dog” aligns well with the key for “lazy,” it indicates a high degree of relevance. This interaction is what allows the word “dog” to pull in the specific value of its neighbor. By the end of this cycle, the original vector for “dog” has been physically updated with the content of its neighbors, evolving from a generic dictionary definition into a contextualized embedding that “understands” whether it refers to a lethargic animal or the sweltering peak of a season.

How softmax relates to attention

Softmax serves as the critical decision-making phase that converts raw compatibility scores into actionable weights. Once the initial dot products are calculated between queries and keys, the resulting scores are passed through the softmax function to be normalized into probabilities that sum to exactly one. This step is what determines the “attention span” of the model, effectively deciding which tokens to prioritize and which to ignore. Without softmax, the model would have no way to objectively weigh the information it gathers, leading to an unmanageable and noisy blend of data.

However, the softmax operation is the primary source of the “performance cliff” seen in long-context AI. Because every token in a sequence must be compared against every other token, a sequence of 8,192 tokens creates a massive [8,192 x 8,192] attention matrix. Normalizing this matrix requires billions of transcendental calculations and grows quadratically with the sequence length. This creates a bottleneck, where the sheer volume of transcendental math can stall the entire inference pipeline.

Blackwell Ultra puts focus on accelerating these exponential calculations specifically to alleviate this mathematical bottleneck and ensure that the system can handle the massive normalization required for large context windows without sacrificing throughput.

Alleviating the softmax bottleneck in Blackwell Ultra

By doubling the throughput of the SFU for exponentials in the Blackwell Ultra architecture, NVIDIA is alleviating this bottleneck and is allowing for a more balanced and efficient processing pipeline. This results in faster overall performance, especially for tasks that are heavy on attention mechanisms.

Figure 2 below illustrates the sequential dependency inherent in the standard attention mechanism, often referred to as the attention loop, as run on the previous generation NVIDIA Blackwell (GB200). Note that the Streaming Multiprocessor (SM) loads two thread blocks running attention loops concurrently. These separate attention loops are denoted in the two different shades of green.

This pipeline consists of three distinct phases that must execute in order:

BMM1 (score calculation): The Tensor Cores perform a matrix multiplication to calculate the raw attention scores, or logits.
Softmax (normalization): The pipeline shifts to the SFUs to normalize these scores into probabilities using exponential functions.
BMM2 (context aggregation): The pipeline returns to the Tensor Cores to multiply the probabilities by the value vectors.

The timeline illustrates the latency constraints inherent in the Blackwell GPU during the execution of the attention kernel. Because the second matrix multiplication (BMM2) acts on the output of the softmax, it cannot begin until the normalization is complete.

The lower throughput of the Blackwell GPU’s SFUs forces the Tensor Cores to idle between the score calculation (BMM1) and the context aggregation (BMM2). This dependency prevents the pipeline from fully saturating the compute resources and extends the duration of the softmax operation

The next timeline, as shown in Figure 3, demonstrates the direct impact of the Blackwell Ultra GPUs in NVIDIA GB300 NVL72 and NVIDIA HGX B300 systems doubled SFU throughput on the same instruction sequence.

Visually, the width of the softmax blocks is reduced by almost 50%, reflecting the hardware’s ability to process MUFU instructions at twice the rate.

This reduction in softmax latency tightens the entire pipeline. The gap between BMM1 and BMM2 is drastically minimized, allowing the Tensor Cores to switch between the query-key multiplication and the probability-value multiplication with minimal stalling. The result is a denser main loop where the high-performance matrix engines spend a larger percentage of the total execution time active, directly translating to higher overall inference throughput.

Benchmarking MUFU.EX2 performance

To empirically verify the theoretical throughput of the MUFU pipeline, we can construct a synthetic micro-benchmark. The following kernel code isolates the exponential instructions to measure the raw cycle count without interference from global memory latency or other arithmetic operations.

This test harness launches a grid of threads where each thread performs a dense loop of MUFU.EX2 instructions. By timing the execution and comparing it against the clock frequency, you can directly calculate the effective instruction throughput and validate the bandwidth saturation point mentioned earlier.

Step 1: Clone the following repository to pull the exp2-bg300.cu benchmark.

git clone https://github.com/jamieliNVIDIA/mufu_ex2_bench.git
cd mufu_ex2_bench

Step 2: Compile with (Using sm100f for GB300 or sm103a for GB200).

nvcc -O3 -gencode=arch=compute_103a,code=sm_103a --extended-lambda -o /tmp/exp2-gb300.out exp2-gb300.cu

Sample results

We see that GB300 performs about 2x higher in FLOPs performance over GB200 for all tested data types, in line with the doubled SFU throughput.

Blackwell (GB200)

exp2 BF16x2 2454 Gop/s (4908 GFLOPS)
exp2 BF16 4938 Gop/s
exp2 FP32 4943 Gop/s

Blackwell Ultra (GB300)

exp2 BF16x2 4996 Gop/s (9992 GFLOPS)
exp2 BF16 9738 Gop/s
exp2 FP32 Time:  10024 Gop/s

Attention forward propagation performance in Blackwell vs Blackwell Ultra

The transition from Blackwell to Blackwell Ultra delivers a targeted increase in compute throughput driven by a 2x increase in SFU performance. This hardware upgrade directly accelerates the forward propagation (FPROP) pipeline for models like DeepSeek-V3.

FPROP is the process where input data travels “forward” through the neural network—from the input layer, through the hidden layers, to the output layer—to generate a prediction. Every time the model produces a single new word, it must run one complete FPROP pass.

Figure 4 below shows that by doubling the throughput of the SFUs, the GB300 drastically reduces the execution time of the softmax layers within the attention blocks. This faster normalization means the GPU spends less time processing the attention scores and more time utilizing the high-speed matrix engines for the next layer’s computation, directly increasing the overall speed of the forward pass.

The benchmark results highlight a ~35% increase in FPROP throughput for FP8 operations. This gain is particularly pronounced in FP8 because the matrix math is already extremely fast. In this low-precision regime, the time spent on softmax becomes a larger percentage of the total step.

Getting started

The performance dynamics of DeepSeek-V3 on the Blackwell Ultra highlight a critical, but often overlooked bottleneck in inference: the computational cost of non-linear operations.

By optimizing and compressing the attention mechanism, state-of-the-art models effectively increase the density of softmax operations relative to standard linear computations, exposing the SFUs as a governor of total throughput.

Blackwell Ultra directly addresses this bottleneck. By doubling the throughput of these specialized units, Blackwell Ultra unblocks the transcendental traffic jam that previously forced the powerful Tensor Cores to idle. The benchmark results confirm the impact, demonstrating a 35% gain in FP8 forward propagation.

For modern, highly optimized architectures, the path to faster inference isn’t just about faster Tensor Cores, it’s also about ensuring the non-linear math units are fast enough to keep up.

Visit NVIDIA’s trtllm-gen repository for more benchmarks and information on utilizing this SFU speedup in workloads. Doubling the throughput of the SFUs for MUFU.EX2 is just one of many features that enable Blackwell Ultra’s fast attention speed. NVIDIA’s extreme hardware-software codesign accelerates the full attention loop through technologies such as:

Offloading critical “find-max” reductions to the Tensor Memory controller via LDTM.STAT.
Optimizing performance using CUDNN.
Optimizing KVCache data movements using NVFP4.

Stay tuned to the NVIDIA technical blog for future posts.

Acknowledgements

Special thanks to the cuDNN engineering team for creating the benchmarks and building the software optimizations making this cutting edge performance possible.

Discuss (0)

About the Authors

About Jamie Li
Jamie Li is a senior technical marketing engineer at NVIDIA focused on wrangling the latest technologies in AI inference. He brings a deep background in both AI software engineering and customer management, translating innovations into practical customer outcomes. Before NVIDIA, he held roles developing, breaking, and fixing AI solutions in the enterprise tech sector. He also did research in medical imaging and holds a master’s degree in Computer Science with an AI focus.

View all posts by Jamie Li

About Alexander Zhurkevich
Alex graduated from the University of Massachusetts Boston with a Master's degree. His past research and working experience is mainly in HPC and AI/ML and computer vision. At NVIDIA, he is a developer technology AI engineer focusing on accelerating ML/DL workloads on GPUs.

View all posts by Alexander Zhurkevich

About Vedaanta Agarwalla
As a senior deep learning software engineer at NVIDIA, Vedaanta focuses on accelerating GPU workloads with a current emphasis on optimizing attention kernels for both training and inference. His previous experience spans ResNet optimizations, GEMMs, and HPC for derivatives pricing in quantitative trading. Vedaanta holds a master’s degree in computer science from the University of Illinois Urbana-Champaign.

View all posts by Vedaanta Agarwalla

About Seonghee Lee
Seonghee Lee is an engineer on the AI platform software team at NVIDIA, focusing on AI Inference-related products. Seonghee holds a master’s in computer science from Stanford University and a bachelor’s in science from Cornell University, specializing in AI. Before joining NVIDIA, she worked at Microsoft Research on developing real-time AI agent interactions.

View all posts by Seonghee Lee

About Roman Anders
Roman Anders is a software engineer on the cuDNN team at NVIDIA, where he focuses on Flash Attention optimizations for inference and training workloads across current and next-generation GPU architectures. His contributions at NVIDIA span RNN, matrix multiplications, and convolutions. Previously, he served as an engineer on the Intel MKL team, where he developed Sparse BLAS, Direct Sparse Solvers, and FFT. He holds a master's degree in applied mathematics and programming from Novosibirsk State University in Russia.

View all posts by Roman Anders