NVIDIA has achieved a world-record large language model (LLM) inference speed. A single NVIDIA DGX B200 node with eight NVIDIA Blackwell GPUs can achieve over 1,000 tokens per second (TPS) per user on the 400-billion-parameter Llama 4 Maverick model, the largest and most powerful model available in the Llama 4 collection. This speed was independently measured by the AI benchmarking service Artificial Analysis.
With this record, Blackwell is the optimal hardware for Llama 4 in any deployment scenario, whether the task is maximizing throughput or minimizing latency. NVIDIA Blackwell is the first platform to break the 1,000 TPS/user milestone on this model, and it reaches 72,000 TPS/server at our highest throughput configuration.
NVIDIA made extensive software optimizations using TensorRT-LLM to get the most from Blackwell GPUs, and trained a speculative decoding draft model using EAGLE-3 techniques. Combining these approaches, NVIDIA has achieved a 4x speed-up relative to the best prior Blackwell baseline. If you have access to B200 hardware, you can set up this accelerated Llama 4 Maverick endpoint yourself using this deployment guide.
Model accuracy result
The optimizations described below significantly increase performance while preserving response accuracy. We leveraged FP8 data types for GEMMs, Mixture of Experts (MoE), and Attention operations to reduce the model size and make use of the high FP8 throughput possible with Blackwell Tensor Core technology. Accuracy when using the FP8 data format matches that of Artificial Analysis BF16 across many metrics, as shown in the table below:
LiveCodeBench | AIME 2024 | GPQA Diamond | MATH-500 | |
AA Reference Llama 4 Maverick (BF16) | 0.397 | 0.39 | 0.671 | 0.889 |
Optimized Llama 4 Maverick (FP8) | 0.383 | 0.40 | 0.686 | 0.876 |
Why does minimizing latency matter?
Most generative AI application contexts require a balance of throughput and latency, ensuring that many customers can simultaneously enjoy a “good enough” experience. However, for critical applications that must make important decisions at speed, minimizing latency for a single client becomes paramount. As the TPS/user record shows, Blackwell hardware is the best choice for any task—whether you need to maximize throughput, balance throughput and latency, or minimize latency for a single user (the focus of this post).
Optimizations for minimum latency
Below is an overview of the kernel optimizations and fusions (denoted in red-dashed squares) NVIDIA applied during the inference. NVIDIA implemented several low-latency GEMM kernels, and applied various kernel fusions (like FC13 + SwiGLU, FC_QKV + attn_scaling and AllReduce + RMSnorm) to make sure Blackwell excels at the minimum latency scenario.

CUDA kernel optimizations and fusions
NVIDIA optimized the CUDA kernels for GEMMs, MoE, and Attention operations to achieve the best performance on the Blackwell GPUs.
- Utilized spatial partitioning (also known as warp specialization) and designed the GEMM kernels to load data from memory in an efficient manner to maximize utilization of the enormous memory bandwidth that the NVIDIA DGX system offers—64TB/s HBM3e bandwidth in total.
- Shuffled the GEMM weight in a swizzled format to allow better layout when loading the computation result from Tensor Memory after the matrix multiplication computations using Blackwell’s fifth-generation Tensor Cores.
- Optimized the performance of the attention kernels by dividing the computations along the sequence length dimension of the K and V tensors, allowing computations to run in parallel across multiple CUDA thread blocks. In addition, NVIDIA utilized distributed shared memory to efficiently reduce results across the thread blocks in the same thread block cluster without the need to access the global memory.
- Enabled fusions between operations to reduce the overheads between kernel executions and the memory loads/stores. For example, NVIDIA fused the AllReduce operation with the following RMSNorm operation and the Quantize operation into one CUDA kernel. NVIDIA also fused the SwiGLU operation with the GEMM preceding it.
Programmatic Dependent Launch (PDL)
Programmatic Dependent Launch (PDL) is a CUDA feature that reduces the GPU idle time between two consecutive CUDA kernels on the same stream, and even allows two CUDA kernels to overlap.
By default, when kernels are launched on the same CUDA stream, the second kernel won’t start its execution until the first kernel has completed. This results in two performance issues: First, there are tiny gaps between two consecutive kernel executions, as illustrated in the figure below, where the GPU sits idle. Second, when the first kernel execution is near its end, the kernel may still occupy some of the Streaming Multiprocessors (SMs) to execute the remaining CUDA blocks, leaving the rest of the SMs on the GPU idle. This leads to an underutilization of the GPU’s computational power.

Using the Programmatic Dependent Launch APIs in CUDA, NVIDIA allows the secondary kernel to start the execution when the primary kernel is still running. During the preamble period, the secondary can execute computations and load the data that don’t depend on the primary kernel’s execution. This not only eliminates the gaps between two consecutive kernels but also results in better GPU utilization; while the first kernel only occupies part of the SMs on the GPU, the other SMs can start running the second kernel.

Speculative decoding
Speculative decoding is a popular technique used to accelerate the inference speed of LLMs without compromising the quality of the generated text. It achieves this goal by having a smaller, faster “draft” model predict a sequence of speculative tokens, which are then verified in parallel by the larger “target” LLM. The speed-up comes from generating potentially multiple tokens in one target model iteration at the cost of extra draft model overhead.

The end-to-end workflow is illustrated in the above diagram. Initially, after a context phase by the target model (which also generates token t1), the draft model rapidly generates a sequence of potential tokens (e.g., d2-d4). The target model then enters a generation phase, which parallelly generates next tokens for the entire draft sequence at once. As shown, it might “accept” several tokens (like d2, d3) if they match what it would have generated, but “reject” others (like d4).
This cycle repeats: The accepted tokens are kept, the target model provides the correct next token if a rejection occurs (like t4 after rejecting d4), and the draft model generates a new speculative sequence (d5-d7). By verifying multiple tokens in parallel instead of generating them one-by-one with the slower Target model—and leveraging the quick guesses of the draft model—significant speed improvements can be achieved, especially when the draft model’s predictions are often correct. Acceptance Length (AL) is defined as how many tokens you can generate, on average, with a single verification step. The higher the AL, the larger the speed-up.
NVIDIA uses a EAGLE3-based architecture as its speculative decoding method, modifying only the FFN size of the speculative layer for better AL. During inference, NVIDIA records low, middle, and high-level features (hidden states after first, middle, and last decoding layers) in the forward pass of the target model. After that, NVIDIA combines those hidden states and token embedding and feeds it to the speculative layer. The speculative layer then generates a sequence of draft tokens autoregressively for parallel verification by the target model.
The overhead of the speculative layer is small but not negligible, so one challenge is to find a good balance between draft length and end-to-end speed-up. The longer the draft length, the higher the AL is—but so too is the cost of additional draft model runs. Based on NVIDIA’s experiments below, draft-length=3 provides the best speed-up.

Host overhead reduction with CUDA Graph and overlap scheduler
Another challenge of speculative decoding is to reduce the communication/synchronization overhead between target model and draft model. If NVIDIA puts sampling/verification logic on the host side, it creates extra sync points between host and device and breaks the CUDA Graph. Instead, retained verification logic on the device side to keep the target model forward-pass, verification logic, and draft mode forward passes into one CUDA Graph. NVIDIA also enabled TensorRT-LLM overlap scheduler to further overlap the current iteration’s model forward pass with the next iteration’s input preparation and CUDA Graph launching.
Torch.compile() to optimize draft model layers
Since the verification logic is implemented on the device side with torch native operations, NVIDIA ended up getting a lot of small torch native kernels. Manually fusing them can be complicated and error-prone, so NVIDIA uses torch.compile() to let OpenAI Triton automatically fuse and generate the best kernels for that part. This helped NVIDIA bring down the overhead of the draft model from 25% to 18% (draft length=3).
Summary
NVIDIA has once again demonstrated its leadership in data center and AI infrastructure, achieving a landmark performance of over 1,000 tokens per second per user on the 400-billion-parameter Llama 4 Maverick. This world-record speed—driven by a combination of the powerful Blackwell architecture, deep software optimization from the CUDA level up, and the significant speed-ups from NVIDIA’s tailored speculative decoding implementation—directly addresses the need for low latency in next-generation AI interactions. As NVIDIA has shown, these advancements ensure that even massive models can deliver the speed and responsiveness required for seamless, real-time user experiences and complex AI agent deployment scenario.