Co-designed hardware, software, and models are key to delivering the highest AI factory throughput and lowest token cost. Measuring this goes far beyond peak chip specifications. Rigorous AI inference performance benchmarks are critical to understanding real-world token output, which drives AI factory revenue.
MLPerf Inference v6.0 is the latest in a series of industry benchmarks that measure performance across a wide range of model architectures and use cases. In this latest round, systems powered by NVIDIA Blackwell Ultra GPUs delivered the highest throughput across the widest range of models and scenarios. This brings the cumulative NVIDIA MLPerf training and inference wins since 2018 to 291, which is 9x of all other submitters combined.
 This round, the NVIDIA partner ecosystem participated broadly, with 14 partners—the largest number of partners submitting on any platform. ASUS, Cisco, CoreWeave, Dell Technologies, GigaComputing, Google Cloud, HPE, Lenovo, Nebius, Netweb Technology, Quanta Cloud Technology (QCT), Red Hat, Supermicro, and Lambda have delivered excellent performance on the NVIDIA platform.Â

This post takes a closer look at the latest benchmark updates, the industry-leading performance achieved on the NVIDIA platform, and the full-stack engineering that makes it possible.
New benchmarks, new performance records
The MLPerf Inference benchmark suite is routinely updated to ensure that it reflects models, modalities, use cases, and deployment scenarios that matter to the community. Only the NVIDIA platform submitted results on all newly added models and scenarios this round, and delivered the highest performance across all of them.
This round of MLPerf Inference added several new tests, including:
- DeepSeek-R1 Interactive: Following the addition of DeepSeek-R1 reasoning LLM based on a sparse mixture-of-experts (MoE) architecture in MLPerf Inference v5.1, MLCommons added a new Interactive scenario with 5x faster minimum token rate and 1.3x shorter time to first token compared to the server scenario, representing higher-interactivity deployments
- Qwen3-VL-235B-A22B: Vision-language model with a total of 235B parameters. This represents the first multi-modal model in the MLPerf Inference suite. Two scenarios are tested: Offline and Server.
- GPT-OSS-120B: 120B-parameter MoE reasoning LLM, developed by OpenAI. This benchmark includes three scenarios: Offline, Server, and Interactive
- WAN-2.2-T2V-A14B: 4B-parameter text-to-video generative AI model. Two scenarios tested: single-stream, which measures the latency to process a single video generation request, and offline, which measures the number of samples processed per second in a batch-processing scenario.
- DLRMv3 – A generative recommendation benchmark that replaces the DLRM-DCNv2 test. It uses a transformer-based architecture that increases model size and compute intensity compared to the prior benchmark. It tests offline and server scenarios.
| Benchmark | DeepSeek-R1 | GPT-OSS-120B | Qwen3-VL | Wan 2.2 | DLRMv3 |
| Offline | 2,494,310 tokens/sec* | 1,046,150 tokens/sec | 79 samples/sec | 0.059 samples/sec | 104,637 samples/sec |
| Server | 1,555,110 tokens/sec* | 1,096,770 tokens/sec | 68 queries/sec | 21 secs**(Single Stream) | 99,997 queries/sec |
| Interactive | 250,634 tokens/sec | 677,199 tokens/sec | *** | *** | *** |
* Not a new scenario in MLPerf Inference v6.0
** Wan 2.2 features a single stream scenario, which measures end-to-end request latency, instead of a server scenario. Lower is better.
*** Not tested in MLPerf Inference v6.0
MLPerf Inference v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 6.0-0039, 6.0-0073, 6.0-0075, 6.0-0076, 6.0-0078, 6.0-0081, 6.0-0094. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

NVIDIA TensorRT-LLM software updates unlock up to 2.7X performance gains on the same Blackwell Ultra GPUs
NVIDIA continually optimizes the performance of its software stack to increase delivered token throughput from existing platforms. This delivers reductions in token production cost and enables AI factory operators to serve more users to generate more revenue with a given infrastructure footprint.
The additional performance also provides headroom to run future AI models and serve existing models in demanding scenarios, such as higher token rates and longer contexts. This continual improvement makes it possible for NVIDIA GPUs introduced years ago to remain productive, at high utilization rates, in the cloud.
This round, NVIDIA GB300 NVL72—launched last year—delivered up to 2.7x higher token throughput compared to its debut submissions just six months ago on the server scenario of the DeepSeek-R1 benchmark1. This means 2.7x more tokens from the same GB300 NVL72-based infrastructure and power footprint, reducing the cost to manufacture each token by more than 60%. This speedup, achieved by NVIDIA partner Nebius, showcases a core advantage of the NVIDIA platform: an open, expansive ecosystem where customers and partners can uniquely optimize and innovate on top of our software stack.
1MLPerf Inference v5.1 and v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 5.1-0072, 6.0-0081. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
Powering the DeepSeek R1 performance improvements in the server and offline scenarios were several software enhancements, including:
- Faster kernels—this included a combination of higher-performance kernels and the use of fewer kernels because of kernel fusions.
- Optimized Attention Data Parallel—Better balancing of context requests between different ranks, enabling significant speedups in end-to-end performance.
The latest features of the open source NVIDIA TensorRT-LLM inference serving software and the NVIDIA Dynamo open source distributed inference serving framework were used to support the newly added and more challenging DeepSeek-R1 Interactive scenario. This includes:
- Disaggregated serving: This capability in Dynamo separates and individually optimizes the configurations of each inference phase (prefill and decode), respectively, enabling optimal overall throughput.
- Wide Expert Parallel (WideEP): For higher-interactivity scenarios, execution time for MoE models is bound by expert weight load time. By splitting, or sharding, the experts across multiple GPUs across NVL72 nodes, this bottleneck is reduced, improving end-to-end performance.
- Multi-Token Prediction (MTP): At higher interactivity levels, batch sizes are smaller, and performance is dominated by how quickly weights can load into memory, leaving compute performance underutilized. By applying compute otherwise that goes unutilized to predict and verify additional tokens in parallel (up to three in this implementation), throughput at high interactivity is increased.
- KV-aware routing: This capability of Dynamo routes inference requests by evaluating their compute costs across different workers.
NVIDIA was the first and only platform to submit DeepSeek-R1 results on MLPerf Inference when the benchmark debuted last year. This round, NVIDIA not only increased performance on returning scenarios for DeepSeek-R1 but ‌was once again the only platform to submit on the newly added interactive scenario.
And even on Llama 3.1 405B—a very large, dense LLM launched almost two years ago— GB300 NVL72 performance increased by 1.5x in the server scenario.
| Benchmark | GB300 NVL72 v5.1 | GB300 NVL72v6.0 | Speedup |
| DeepSeek-R1(Server) | 2,907 tokens/sec/gpu | 8,064 tokens/sec/gpu | 2.77x |
| DeepSeek-R1(Offline) | 5,842 tokens/sec/gpu | 9,821 tokens/sec/gpu | 1.68x |
| Llama 3.1 405B(Server) | 170 tokens/sec/gpu | 259 tokens/sec/gpu | 1.52x |
| Llama 3.1 405B(Offline) | 224 tokens/sec/gpu | 271 tokens/sec/gpu | 1.21x |
MLPerf Inference v5.1 and v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 5.1-0072, 6.0-0017, 6.0-0078, 6.0-0082. Per chip performance is derived by dividing total throughput by the number of reported chips. Per-chip performance is not a primary metric of MLPerf Inference v5.1 or v6.0. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
Additionally, NVIDIA submissions on the newly added multimodal, video generation, and recommendation benchmarks were powered by open source software frameworks optimized for the NVIDIA platform. The Qwen3-VL vision-language submission used the vLLM open source framework, showing how the community is rapidly building advanced multimodal optimizations to accelerate image-heavy inference workloads on the latest GPUs like NVIDIA Blackwell Ultra. The WAN-2.2 text-to-video submission used the TensorRT-LLM VisualGen, which accelerates diffusion-based video generation pipelines on NVIDIA GPUs.
For DLRMv3, the submission was built on two open-source projects: the NVIDIA recsys-example for high-performance transformer-based recommendation inference, and NV Embedding Cache for GPU-accelerated embedding table lookups. Both were critical to achieving record throughput on this more demanding generative recommendation benchmark.
Through extensive and ongoing engineering, NVIDIA is continually increasing performance on existing hardware on existing models, as evidenced by these results. At the same time, NVIDIA collaborates closely with model builders and open source inference frameworks to ensure that the latest models run on the NVIDIA platform on the day of launch.
Scale-out inference with NVIDIA Quantum-X800 InfiniBand platform enables millions of tokens per second
NVIDIA also set new throughput records at scale on the DeepSeek-R1 model in the offline and server scenarios by submitting results using four GB300 NVL72 systems interconnected with NVIDIA Quantum-X800 InfiniBand scale-out networking.
| DeepSeek-R1 | 4x GB300 NVL72 | Tokens/Second |
| Offline | 2,494,310 |
| Server | 1,555,110 |
MLPerf Inference v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 6.0-0076. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
With 288 Blackwell Ultra GPUs—the largest scale ever submitted to any benchmark in MLPerf Inference—submissions set new system-level throughput records, enabling millions of tokens processed per second.
Looking ahead to MLPerf Endpoints
Delivered inference throughput takes extreme co-design across many chips, system architecture, data center design, and software. The latest MLPerf Inference v6.0 results show that NVIDIA yields unmatched inference throughput across the broadest range of workloads, from massive LLMs to advanced vision language models, to generative recommender systems and more, on industry-standard benchmarks.
AI inference workloads also continue to evolve rapidly, as model sizes grow and context lengths rise. As agentic AI becomes more prevalent, premium use cases that require ultra-fast token rates are emerging.
NVIDIA has been working, as part of the MLCommons consortium, to lead the definition of the MLPerf Endpoints benchmark. MLPerf Endpoints will give the community a rigorous, auditable picture of how deployed services perform under real API traffic—capturing key performance metrics that chip-level benchmarks alone cannot reveal—while providing the rigor and result integrity that defines MLPerf benchmarks.
To explore the latest performance on the NVIDIA platform across training, inference, and high-performance computing, please see our deep learning product performance page.
Acknowledgements
NVIDIA MLPerf Inference v6.0 results reflect the work of many talented engineers across the company. We’d like to acknowledge the contributions of the following individuals (last name sorted):
Tomar Bar-on, Nitin Sai Bommi, Viraat Chandra, Alice Cheng, Jerry Chen, Xiaoming Chen, Jesus Corbal San Adrian, Ashutosh Dhar, Kefeng Duan, Wookje Han, Kyle Huang, Kris Hung, Rashid Kaleem, Khubaib Khubaib, Zihao Kong, Tin-Yin Lai, Tao Li, Forrest Lin, Wanqian Li, Alex Liu, Jintao Peng, Yuxian Qiu, Junyi Qiu, Xiaowei Shi, Olivia Stoner, Jacob Subag, Tong Tong, Harshil Vagadia, Shobhit Verma, June Yang, Tailing Yuan, Ben Zhang… and many others across NVIDIA whose efforts made these results possible.