Data Center / Cloud

NVIDIA Platform Delivers Lowest Token Cost Enabled by Extreme Co-Design

Apr 01, 2026

By Ashraf Eassa and Zhihan Jiang

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA Blackwell Ultra GPUs, supported by a broad partner ecosystem, achieved the highest throughput and set new records in MLPerf Inference v6.0, being the only platform to submit on all newly added models and scenarios, including advanced benchmarks like DeepSeek-R1 Interactive, Qwen3-VL-235B-A22B, GPT-OSS-120B, WAN-2.2-T2V-A14B, and DLRMv3.
Substantial performance improvements were realized through continuous co-optimization of hardware and open-source software, notably with advancements in NVIDIA TensorRT-LLM and Dynamo frameworks; techniques such as kernel fusion, optimized attention data parallelism, disaggregated serving, Wide Expert Parallel, Multi-Token Prediction, and KV-aware routing drove up to 2.7x throughput gains and over 60% cost reduction per token on the same infrastructure.
Scale-out inference using NVIDIA Quantum-X800 InfiniBand interconnected four GB300 NVL72 systems with 288 Blackwell Ultra GPUs, setting system-level throughput records in MLPerf Inference v6.0, while ongoing collaboration with the MLCommons consortium advances the development of MLPerf Endpoints to rigorously evaluate deployed AI service performance under real-world API traffic.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Co-designed hardware, software, and models are key to delivering the highest AI factory throughput and lowest token cost. Measuring this goes far beyond peak chip specifications. Rigorous AI inference performance benchmarks are critical to understanding real-world token output, which drives AI factory revenue.

MLPerf Inference v6.0 is the latest in a series of industry benchmarks that measure performance across a wide range of model architectures and use cases. In this latest round, systems powered by NVIDIA Blackwell Ultra GPUs delivered the highest throughput across the widest range of models and scenarios. This brings the cumulative NVIDIA MLPerf training and inference wins since 2018 to 291, which is 9x of all other submitters combined.

This round, the NVIDIA partner ecosystem participated broadly, with 14 partners—the largest number of partners submitting on any platform. ASUS, Cisco, CoreWeave, Dell Technologies, GigaComputing, Google Cloud, HPE, Lenovo, Nebius, Netweb Technology, Quanta Cloud Technology (QCT), Red Hat, Supermicro, and Lambda have delivered excellent performance on the NVIDIA platform.

This post takes a closer look at the latest benchmark updates, the industry-leading performance achieved on the NVIDIA platform, and the full-stack engineering that makes it possible.

New benchmarks, new performance records

The MLPerf Inference benchmark suite is routinely updated to ensure that it reflects models, modalities, use cases, and deployment scenarios that matter to the community. Only the NVIDIA platform submitted results on all newly added models and scenarios this round, and delivered the highest performance across all of them.

This round of MLPerf Inference added several new tests, including:

DeepSeek-R1 Interactive: Following the addition of DeepSeek-R1 reasoning LLM based on a sparse mixture-of-experts (MoE) architecture in MLPerf Inference v5.1, MLCommons added a new Interactive scenario with 5x faster minimum token rate and 1.3x shorter time to first token compared to the server scenario, representing higher-interactivity deployments
Qwen3-VL-235B-A22B: Vision-language model with a total of 235B parameters. This represents the first multi-modal model in the MLPerf Inference suite. Two scenarios are tested: Offline and Server.
GPT-OSS-120B: 120B-parameter MoE reasoning LLM, developed by OpenAI. This benchmark includes three scenarios: Offline, Server, and Interactive
WAN-2.2-T2V-A14B: 4B-parameter text-to-video generative AI model. Two scenarios tested: single-stream, which measures the latency to process a single video generation request, and offline, which measures the number of samples processed per second in a batch-processing scenario.
DLRMv3 – A generative recommendation benchmark that replaces the DLRM-DCNv2 test. It uses a transformer-based architecture that increases model size and compute intensity compared to the prior benchmark. It tests offline and server scenarios.

Benchmark	DeepSeek-R1	GPT-OSS-120B	Qwen3-VL	Wan 2.2	DLRMv3
Offline	2,494,310 tokens/sec*	1,046,150 tokens/sec	79 samples/sec	0.059 samples/sec	104,637 samples/sec
Server	1,555,110 tokens/sec*	1,096,770 tokens/sec	68 queries/sec	21 secs**(Single Stream)	99,997 queries/sec
Interactive	250,634 tokens/sec	677,199 tokens/sec	***	***	***

Table 1. NVIDIA platform throughput on newly added workloads and scenarios in MLPerf Inference v6.0

* Not a new scenario in MLPerf Inference v6.0
** Wan 2.2 features a single stream scenario, which measures end-to-end request latency, instead of a server scenario. Lower is better.
*** Not tested in MLPerf Inference v6.0

MLPerf Inference v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 6.0-0039, 6.0-0073, 6.0-0075, 6.0-0076, 6.0-0078, 6.0-0081, 6.0-0094. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

NVIDIA TensorRT-LLM software updates unlock up to 2.7X performance gains on the same Blackwell Ultra GPUs

NVIDIA continually optimizes the performance of its software stack to increase delivered token throughput from existing platforms. This delivers reductions in token production cost and enables AI factory operators to serve more users to generate more revenue with a given infrastructure footprint.

The additional performance also provides headroom to run future AI models and serve existing models in demanding scenarios, such as higher token rates and longer contexts. This continual improvement makes it possible for NVIDIA GPUs introduced years ago to remain productive, at high utilization rates, in the cloud.

This round, NVIDIA GB300 NVL72—launched last year—delivered up to 2.7x higher token throughput compared to its debut submissions just six months ago on the server scenario of the DeepSeek-R1 benchmark¹. This means 2.7x more tokens from the same GB300 NVL72-based infrastructure and power footprint, reducing the cost to manufacture each token by more than 60%. This speedup, achieved by NVIDIA partner Nebius, showcases a core advantage of the NVIDIA platform: an open, expansive ecosystem where customers and partners can uniquely optimize and innovate on top of our software stack.

¹MLPerf Inference v5.1 and v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 5.1-0072, 6.0-0081. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

Powering the DeepSeek R1 performance improvements in the server and offline scenarios were several software enhancements, including:

Faster kernels—this included a combination of higher-performance kernels and the use of fewer kernels because of kernel fusions.
Optimized Attention Data Parallel—Better balancing of context requests between different ranks, enabling significant speedups in end-to-end performance.

The latest features of the open source NVIDIA TensorRT-LLM inference serving software and the NVIDIA Dynamo open source distributed inference serving framework were used to support the newly added and more challenging DeepSeek-R1 Interactive scenario. This includes:

Disaggregated serving: This capability in Dynamo separates and individually optimizes the configurations of each inference phase (prefill and decode), respectively, enabling optimal overall throughput.
Wide Expert Parallel (WideEP): For higher-interactivity scenarios, execution time for MoE models is bound by expert weight load time. By splitting, or sharding, the experts across multiple GPUs across NVL72 nodes, this bottleneck is reduced, improving end-to-end performance.
Multi-Token Prediction (MTP): At higher interactivity levels, batch sizes are smaller, and performance is dominated by how quickly weights can load into memory, leaving compute performance underutilized. By applying compute otherwise that goes unutilized to predict and verify additional tokens in parallel (up to three in this implementation), throughput at high interactivity is increased.
KV-aware routing: This capability of Dynamo routes inference requests by evaluating their compute costs across different workers.

NVIDIA was the first and only platform to submit DeepSeek-R1 results on MLPerf Inference when the benchmark debuted last year. This round, NVIDIA not only increased performance on returning scenarios for DeepSeek-R1 but ‌was once again the only platform to submit on the newly added interactive scenario.

And even on Llama 3.1 405B—a very large, dense LLM launched almost two years ago— GB300 NVL72 performance increased by 1.5x in the server scenario.

Benchmark	GB300 NVL72 v5.1	GB300 NVL72v6.0	Speedup
DeepSeek-R1(Server)	2,907 tokens/sec/gpu	8,064 tokens/sec/gpu	2.77x
DeepSeek-R1(Offline)	5,842 tokens/sec/gpu	9,821 tokens/sec/gpu	1.68x
Llama 3.1 405B(Server)	170 tokens/sec/gpu	259 tokens/sec/gpu	1.52x
Llama 3.1 405B(Offline)	224 tokens/sec/gpu	271 tokens/sec/gpu	1.21x

Table 2. Performance improvements, normalized on a per-GPU basis, on DeepSeek-R1 and Llama 3.1 405B server and offline scenarios in v6.0 compared to v5.1

MLPerf Inference v5.1 and v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 5.1-0072, 6.0-0017, 6.0-0078, 6.0-0082. Per chip performance is derived by dividing total throughput by the number of reported chips. Per-chip performance is not a primary metric of MLPerf Inference v5.1 or v6.0. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

Additionally, NVIDIA submissions on the newly added multimodal, video generation, and recommendation benchmarks were powered by open source software frameworks optimized for the NVIDIA platform. The Qwen3-VL vision-language submission used the vLLM open source framework, showing how the community is rapidly building advanced multimodal optimizations to accelerate image-heavy inference workloads on the latest GPUs like NVIDIA Blackwell Ultra. The WAN-2.2 text-to-video submission used the TensorRT-LLM VisualGen, which accelerates diffusion-based video generation pipelines on NVIDIA GPUs.

For DLRMv3, the submission was built on two open-source projects: the NVIDIA recsys-example for high-performance transformer-based recommendation inference, and NV Embedding Cache for GPU-accelerated embedding table lookups. Both were critical to achieving record throughput on this more demanding generative recommendation benchmark.

Through extensive and ongoing engineering, NVIDIA is continually increasing performance on existing hardware on existing models, as evidenced by these results. At the same time, NVIDIA collaborates closely with model builders and open source inference frameworks to ensure that the latest models run on the NVIDIA platform on the day of launch.

Scale-out inference with NVIDIA Quantum-X800 InfiniBand platform enables millions of tokens per second

NVIDIA also set new throughput records at scale on the DeepSeek-R1 model in the offline and server scenarios by submitting results using four GB300 NVL72 systems interconnected with NVIDIA Quantum-X800 InfiniBand scale-out networking.

DeepSeek-R1 \| 4x GB300 NVL72	Tokens/Second
Offline	2,494,310
Server	1,555,110

Table 3. DeepSeek-R1 throughput on four GB300 NVL72 systems scaled up with NVLink and scaled out with NVIDIA Quantum-X800 InfiniBand

MLPerf Inference v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 6.0-0076. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

With 288 Blackwell Ultra GPUs—the largest scale ever submitted to any benchmark in MLPerf Inference—submissions set new system-level throughput records, enabling millions of tokens processed per second.

Looking ahead to MLPerf Endpoints

Delivered inference throughput takes extreme co-design across many chips, system architecture, data center design, and software. The latest MLPerf Inference v6.0 results show that NVIDIA yields unmatched inference throughput across the broadest range of workloads, from massive LLMs to advanced vision language models, to generative recommender systems and more, on industry-standard benchmarks.

AI inference workloads also continue to evolve rapidly, as model sizes grow and context lengths rise. As agentic AI becomes more prevalent, premium use cases that require ultra-fast token rates are emerging.

NVIDIA has been working, as part of the MLCommons consortium, to lead the definition of the MLPerf Endpoints benchmark. MLPerf Endpoints will give the community a rigorous, auditable picture of how deployed services perform under real API traffic—capturing key performance metrics that chip-level benchmarks alone cannot reveal—while providing the rigor and result integrity that defines MLPerf benchmarks.

To explore the latest performance on the NVIDIA platform across training, inference, and high-performance computing, please see our deep learning product performance page.

Acknowledgements

NVIDIA MLPerf Inference v6.0 results reflect the work of many talented engineers across the company. We’d like to acknowledge the contributions of the following individuals (last name sorted):

Vedaanta Agarwalla, Tomar Bar-on, Nitin Sai Bommi, John Angel Calderon Espinoza, Bin Chai, Viraat Chandra, Alice Cheng, Jerry Chen, Xiaoming Chen, Jesus Corbal San Adrian, Ashutosh Dhar, Kefeng Duan, Yubo Gao, Anerudhan Gopal, Wookje Han, Max Hu, Kyle Huang, Kris Hung, Rashid Kaleem, Khubaib Khubaib, Zihao Kong, Tin-Yin Lai, Tao Li, Forrest Lin, Wanqian Li, Alex Liu, Mingyuan Ma, Baorun Mu, Jintao Peng, Yuxian Qiu, Junyi Qiu, Xiaowei Shi, Qidong Su, Olivia Stoner, Jacob Subag, Jiayu Sun, Tong Tong, Harshil Vagadia, Shobhit Verma, Shang Wang, June Yang, Tailing Yuan, Ben Zhang, Zhanda Zhu, and many others across NVIDIA whose efforts made these results possible.

Discuss (0)

About the Authors

About Ashraf Eassa
Ashraf Eassa is a senior product marketing manager at NVIDIA, focusing on deep learning, training and inference. He holds bachelor's degrees in computer science and mathematics from the University of Vermont.

View all posts by Ashraf Eassa

About Zhihan Jiang
Zhihan Jiang is an engineer manager in the TensorRT team at NVIDIA and focuses on delivering world-class generative AI inference results in MLPerf Inference. Before working on MLPerf, he worked on TensorRT autonomous safety libraries and infrastructure, and NVIDIA CPU architecture modeling. Zhihan holds a master’s degree in Electrical Engineering from Stanford University, and a bachelor’s degree in Computer Engineering from Georgia Tech.

View all posts by Zhihan Jiang