Transformer architecture has become a foundational breakthrough driving the revolution in generative AI, powering large language models (LLMs) like GPT, DeepSeek, and Llama. The key to transformer architecture is the self-attention mechanism, which enables models to process an entire input sequence rather than word by word. This parallelism enables the capture of long-range dependencies.
While the self-attention mechanism is powerful, its computational and memory complexity is quadratic. This creates a memory bottleneck when dealing with the long context windows of modern LLMs.
In this post, we’ll discuss FlashAttention, an algorithmic breakthrough that can mitigate this, reducing computational and memory complexity.
What is FlashAttention?
FlashAttention is an input/output-aware (IO-aware) algorithm that computes the same mathematical result as standard attention, more efficiently. FlashAttention achieves this with:
- Reduced memory access that minimizes the slow transfer of data between a GPU’s main high-bandwidth memory (HBM) and the faster but much smaller on-chip static random access memory (SRAM). It achieves this by combining computational steps (like matrix multiplication and softmax) into a single optimized GPU kernel. A technique called kernel fusion.
- Near-linear memory uses techniques such as tiling (breaking the computation into smaller blocks) and online softmax (normalizing the distribution incrementally). FlashAttention reduces the memory complexity from O(N2) to O(N) with respect to sequence length N.
These optimizations lead to faster training and inference. This also enables models to handle longer sequences of tokens, for applications that require maintaining long-running conversations, like processing high-resolution images.


FlashAttention-4
FlashAttention-4 (FA4) is the latest iteration of optimized CUDA kernels, with a leap in efficiency. It’s hardware-software co-designed and tailored to maximize performance on the NVIDIA Blackwell architecture, like the NVIDIA HGX B200.
FA4 achieves a peak performance of 1,605 TFLOPS/s, harnessing 71% of the hardware’s theoretical maximum. By redesigning the attention mechanism to address Blackwell’s asymmetric scaling (where compute power scales much faster than memory bandwidth), FA4 outperforms standard baselines, delivering up to 1.3x speedup over NVIDIA cuDNN and 2.4x over NVIDIA Triton Inference Server implementations.
These gains extend to the backward pass, where FA4 uses tensor memory (TMEM) dedicated, Tensor Core—proximate memory (more available register capacity)—to bypass register accumulation and relieve register pressure. This enables larger tiles (up to 128×128) and deeper pipelines, while reducing shared memory (SMEM) traffic and maximizing operation overlap. This ensures that the training speed keeps pace with the doubled throughput of the new Tensor Cores rather than being bottlenecked by memory logistics.
FA4 co-designs the algorithm and kernel implementation around the following new features and mitigation strategies for Blackwell:
| Blackwell hardware feature | Bottleneck | FA4 technique |
| TMEM – 256 KB on-chip memory per SM; 5th-gen tensor cores asynchronously write outputs directly to TMEM | Standard backward passes overuse shared memory (SMEM) for intermediates, creating a bandwidth bottleneck relative to tensor cores | TMEM-based backward pass: FA4 stores backward intermediates (S, P, dP, dS, dQ) directly in TMEM, drastically reducing SMEM traffic |
| SMEM | SMEM bandwidth becomes limiting as tensor core performance scales faster than memory movement | Reduced SMEM pressure by relocating intermediates to TMEM |
| Asymmetric scaling | Tensor Core throughput roughly doubles (~2.25 PFLOPs), while MUFU throughput remains unchanged from the prior generation (16 ops/clock) | Compute rebalancing to reduce reliance on MUFU-heavy paths |
| Exponential units (MUFU) | Softmax exponentials dominate runtime, exceeding matmul time by ~25–60% | Software-emulated exponentials using FMA-based polynomial approximations alongside MUFU |
| Expanded MMA tile size (128×128) | Larger tiles increase register pressure and impose stricter scheduling constraints | New CTA scheduling and register allocation, including LPT scheduling for causal masking |
| Fully asynchronous tensor cores | Sequential MMA–softmax dependencies can leave compute units idle if not overlapped | Redesigned asynchronous pipelines to maximize overlap across MMA, softmax, and memory operations |
| Finite non-matmul resources | Non-matmul ALUs scale more slowly than tensor cores | Algorithmic minimization of non-matmul work |
| Online softmax | Redundant rescaling wastes non-matmul cycles | Conditional softmax rescaling, updating only when the running max crosses a threshold |
| CUDA 13 and CUDA-X tooling | Kernel complexity slows tuning and optimization | Kernel-level graphs and performance tools used to optimize FA4 kernels |
| Developer productivity | Complex C++ templates slow compile times and hinder iteration | CuTe DSL in Python, achieving 20–30× faster compile times than FA3 while preserving kernel expressivity |
The forward and backward pass performance gains on a Blackwell GPU for different sequence sizes are shown in Figures 1 and 2, respectively.


Learn more
The FlashAttention-4 algorithm was developed using a hardware-software co-design and kernel pipeline that mitigates bottlenecks induced by modern accelerators. FA4 uses the NVIDIA Blackwell Tensor Core and Tensor Memory architecture to increase performance and power efficiency, especially in multi-GPU multi-Node (MGMN) distributed configurations. The forward and backward pass kernel design incorporates various optimizations that achieve speedups over previous versions of FlashAttention algorithms.
Inference frameworks such as SGLang and vLLM are compatible with FlashAttention-4 prefill, and NVIDIA has incorporated FA4 techniques into NVIDIA cuDNN 9.14.
Learn more about cuDNN and unlocking deep learning performance on Blackwell with cuDNN.