AI Inference

The days of raw speed being the only metric that matters are behind us. Now it’s about throughput, efficiency, and economics at scale. As AI evolves from providing one-shot answers to engaging in multi-step reasoning, the demand for inference and its underlying economics is increasing. This shift significantly boosts compute demand due to the generation of far more tokens per query. Metrics such as tokens per watt, cost per million tokens, and tokens per second per user are crucial alongside throughput. For power-limited AI factories, NVIDIA's continuous software improvements translate into higher token revenue over time, underscoring the importance of our technological advancements.

Pareto curves illustrate how NVIDIA Blackwell provides the best balance across the full spectrum of production priorities, including cost, energy efficiency, throughput, and responsiveness. Optimizing systems for a single scenario can limit deployment flexibility,‌ leading to inefficiencies at other points on the curve. NVIDIA’s full-stack design approach ensures efficiency and value across multiple real-life production scenarios. Blackwell’s leadership stems from its extreme hardware-software co-design, embodying a full-stack architecture built for speed, efficiency, and scalability.

Learn about how Mixture of Experts Powers the Most Intelligent Frontier AI Models, Runs 10x Faster on NVIDIA Blackwell NVL72 in this blog.

Explore the methodology used to obtain these results and learn how to replicate the tests by executing Benchmarking Recipes yourself.

MLPerf Inference v5.1 Performance Benchmarks

Offline Scenario, Closed Division

Network Throughput GPU Server GPU Version Target Accuracy Dataset
DeepSeek R1 420,659 tokens/sec 72x GB300 72x GB300-288GB_aarch64, TensorRT NVIDIA GB300 99% of FP16 (exact match 81.9132%) mlperf_deepseek_r1
289,712 tokens/sec 72x GB200 72x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99% of FP16 (exact match 81.9132%) mlperf_deepseek_r1
33,379 tokens/sec 8x B200 NVIDIA DGX B200 NVIDIA B200 99% of FP16 (exact match 81.9132%) mlperf_deepseek_r1
Llama3.1 405B 16,104 tokens/sec 72x GB300 72x GB300-288GB_aarch64, TensorRT NVIDIA GB300 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) Subset of LongBench, LongDataCollections, Ruler, GovReport
14,774 tokens/sec 72x GB200 72x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) Subset of LongBench, LongDataCollections, Ruler, GovReport
1,660 tokens/sec 8x B200 Dell PowerEdge XE9685L NVIDIA B200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) Subset of LongBench, LongDataCollections, Ruler, GovReport
553 tokens/sec 8x H200 Nebius H200 NVIDIA H200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B 51,737 tokens/sec 4x GB200 4x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) OpenOrca (max_seq_len=1024)
102,909 tokens/sec 8x B200 ThinkSystem SR680a V3 NVIDIA B200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) OpenOrca (max_seq_len=1024)
35,317 tokens/sec 8x H200 Dell PowerEdge XE9680 NVIDIA H200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) OpenOrca (max_seq_len=1024)
Llama3.1 8B 146,960 tokens/sec 8x B200 ThinkSystem SR780a V3 NVIDIA B200 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) CNN Dailymail (v3.0.0, max_seq_len=2048)
66,037 tokens/sec 8x H200 HPE Cray XD670 NVIDIA H200 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) CNN Dailymail (v3.0.0, max_seq_len=2048)
Whisper 22,273 samples/sec 4x GB200 BM.GPU.GB200.4 NVIDIA GB200 99% of FP32 and 99.9% of FP32 (WER=2.0671%) LibriSpeech
45,333 samples/sec 8x B200 NVIDIA DGX B200 NVIDIA B200 99% of FP32 and 99.9% of FP32 (WER=2.0671%) LibriSpeech
34,451 samples/sec 8x H200 HPE Cray XD670 NVIDIA H200 99% of FP32 and 99.9% of FP32 (WER=2.0671%) LibriSpeech
Stable Diffusion XL 33 samples/sec 8x B200 NVIDIA DGX B200 NVIDIA B200 FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] Subset of coco-2014 val
19 samples/sec 8x H200 QuantaGrid D74H-7U NVIDIA H200 FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] Subset of coco-2014 val
RGAT 651,230 samples/sec 8x B200 NVIDIA DGX B200 NVIDIA B200 99% of FP32 (72.86%) IGBH
RetinaNet 14,997 samples/sec 8x H200 HPE Cray XD670 NVIDIA H200 99% of FP32 (0.3755 mAP) OpenImages (800x800)
DLRMv2 647,861 samples/sec 8x H200 QuantaGrid D74H-7U NVIDIA H200 99% of FP32 and 99.9% of FP32 (AUC=80.31%) Synthetic Multihot Criteo Dataset

Server Scenario - Closed Division

Network Throughput GPU Server GPU Version Target Accuracy MLPerf Server Latency
Constraints (ms)
Dataset
DeepSeek R1 209,328 tokens/sec 72x GB300 72x GB300-288GB_aarch64, TensorRT NVIDIA GB300 99% of FP16 (exact match 81.9132%) TTFT/TPOT: 2000 ms/80 ms mlperf_deepseek_r1
167,578 tokens/sec 72x GB200 72x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99% of FP16 (exact match 81.9132%) TTFT/TPOT: 2000 ms/80 ms mlperf_deepseek_r1
18,592 tokens/sec 8x B200 NVIDIA DGX B200 NVIDIA B200 99% of FP16 (exact match 81.9132%) TTFT/TPOT: 2000 ms/80 ms mlperf_deepseek_r1
Llama3.1 405B 12,248 tokens/sec 72x GB300 72x GB300-288GB_aarch64, TensorRT NVIDIA GB300 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
11,614 tokens/sec 72x GB200 72x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
1,280 tokens/sec 8x B200 Nebius B200 NVIDIA B200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
296 tokens/sec 8x H200 QuantaGrid D74H-7U NVIDIA H200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama3.1 405B Interactive 9,921 tokens/sec 72x GB200 72x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 4500 ms/80 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
771 tokens/sec 8x B200 Nebius B200 NVIDIA B200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 4500 ms/80 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
203 tokens/sec 8x H200 Nebius H200 NVIDIA H200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 4500 ms/80 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B 49,360 tokens/sec 4x GB200 4x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/200 ms OpenOrca (max_seq_len=1024)
101,611 tokens/sec 8x B200 Nebius B200 NVIDIA B200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/200 ms OpenOrca (max_seq_len=1024)
34,194 tokens/sec 8x H200 ASUSTeK ESC N8 H200 NVIDIA H200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/200 ms OpenOrca (max_seq_len=1024)
Llama2 70B Interactive 29,746 tokens/sec 4x GB200 4x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 450 ms/40 ms OpenOrca (max_seq_len=1024)
62,851 tokens/sec 8x B200 G894-SD1 NVIDIA B200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 450 ms/40 ms OpenOrca (max_seq_len=1024)
23,080 tokens/sec 8x H200 Nebius H200 NVIDIA H200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 450 ms/40 ms OpenOrca (max_seq_len=1024)
Llama3.1 8B 128,794 tokens/sec 8x B200 Dell PowerEdge XE9685L NVIDIA B200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/100 ms OpenOrca (max_seq_len=1024)
64,915 tokens/sec 8x H200 HPE Cray XD670 NVIDIA H200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/100 ms OpenOrca (max_seq_len=1024)
Llama3.1 8B Interactive 122,269 tokens/sec 8x B200 AS-4126GS-NBR-LCC NVIDIA B200 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) TTFT/TPOT: 500 ms/30 ms CNN Dailymail (v3.0.0, max_seq_len=2048)
54,118 tokens/sec 8x H200 QuantaGrid D74H-7U NVIDIA H200 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) TTFT/TPOT: 500 ms/30 ms CNN Dailymail (v3.0.0, max_seq_len=2048)
Stable Diffusion XL 29 queries/sec 8x B200 Supermicro SYS-422GA-NBRT-LCC NVIDIA B200 FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] 20 s Subset of coco-2014 val
18 queries/sec 8x H200 QuantaGrid D74H-7U NVIDIA H200 FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] 20 s Subset of coco-2014 val
RetinaNet 14,406 queries/sec 8x H200 ASUSTeK ESC N8 H200 NVIDIA H200 99% of FP32 (0.3755 mAP) 100 ms OpenImages (800x800)
DLRMv2 591,162 queries/sec 8x H200 ASUSTeK ESC N8 H200 NVIDIA H200 99% of FP32 (AUC=80.31%) 60 ms Synthetic Multihot Criteo Dataset

MLPerf™ v5.1 Inference Closed: DeepSeek R1 99% of FP16, Llama3.1 405B 99% of FP16, Llama2 70B Interactive 99.9% of FP32, Llama2 70B 99.9% of FP32, Stable Diffusion XL, Whisper, RetinaNet, RGAT, DLRM 99% of FP32 accuracy target: 5.1-0007, 5.1-0009, 5.1-0026, 5.1-0028, 5.1-0046, 5.1-0049, 5.1-0060, 5.1-0061, 5.1-0062, 5.1-0069, 5.1-0070, 5.1-0071, 5.1-0072, 5.1-0073, 5.1-0075, 5.1-0077, 5.1-0079, 5.1-0086. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
Llama3.1 8B Max Sequence Length = 2,048
Llama2 70B Max Sequence Length = 1,024
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

LLM Inference Performance of NVIDIA Data Center Products

B200 Inference Performance - Max Throughput

Model Parallelism Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Qwen3 235B A22B DEP4 1000 1000 5,764 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 235B A22B DEP4 1024 8192 3,389 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 235B A22B DEP4 1024 32768 1,255 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 235B A22B DEP4 8192 1024 1,410 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 235B A22B DEP4 32768 1024 319 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 30B A3B TP1 1000 1000 26,971 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 30B A3B TP1 1024 8192 13,497 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 30B A3B TP1 1024 32768 4,494 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 30B A3B TP1 8192 1024 5,735 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 30B A3B TP1 32768 1024 1,265 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Llama v4 Maverick DEP4 1000 1000 11,337 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Llama v4 Maverick DEP4 1024 8192 5,174 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Llama v4 Maverick DEP4 1024 32768 2,204 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Llama v4 Maverick DEP4 8192 1024 3,279 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Llama v4 Maverick DEP4 32768 1024 859 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
GPT-OSS 20B TP1 1000 1000 53,812 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
GPT-OSS 20B TP1 1024 8192 34,702 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
GPT-OSS 20B TP1 1024 32768 14,589 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
GPT-OSS 20B TP1 8192 1024 11,904 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
GPT-OSS 20B TP1 32768 1024 2,645 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

RTX PRO 6000 Blackwell Server Edition Inference Performance - Max Throughput

Model Parallelism Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Qwen3 235B A22B DEP2 PP2 1000 1000 1,731 output tokens/sec/gpu 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.1 NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 235B A22B DEP8 1024 8192 711 output tokens/sec/gpu 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.1 NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 235B A22B DEP2 PP2 32768 1024 70 output tokens/sec/gpu 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.1 NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 30B A3B TP1 1000 1000 9,938 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.1 NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 30B A3B TP1 1024 8192 3,621 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.1 NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 30B A3B TP1 8192 1024 1,914 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.1 NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 30B A3B TP1 32768 1024 374 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.1 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 9B v2 TP1 500 500 1,711 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 9B v2 TP1 1000 4000 790 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 9B v2 TP1 4000 1000 1,238 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 12B v2 TP1 500 500 1,229 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 12B v2 TP1 1000 4000 1,202 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 12B v2 TP1 4000 1000 1,071 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron 3 Nano 30B TP1 500 500 6,616 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron 3 Nano 30B TP1 1000 4000 4,957 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron 3 Nano 30B TP1 4000 1000 5,353 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

RTX PRO 4500 Blackwell Server Edition Inference Performance - Max Throughput

Model Parallelism Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Nemotron Nano 9B v2 TP1 500 500 945 output tokens/sec/gpu 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 4500 Blackwell Server Edition
Nemotron Nano 9B v2 TP1 1000 4000 410 output tokens/sec/gpu 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 4500 Blackwell Server Edition
Nemotron Nano 9B v2 TP1 4000 1000 636 output tokens/sec/gpu 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 4500 Blackwell Server Edition
Nemotron Nano 12B v2 TP1 500 500 678 output tokens/sec/gpu 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 4500 Blackwell Server Edition
Nemotron Nano 12B v2 TP1 1000 4000 681 output tokens/sec/gpu 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 4500 Blackwell Server Edition
Nemotron Nano 12B v2 TP1 4000 1000 566 output tokens/sec/gpu 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 4500 Blackwell Server Edition

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

H200 Inference Performance - Max Throughput

Model Parallelism Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Qwen3 235B A22B DEP4 1000 1000 3,288 output tokens/sec/gpu 4x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
Qwen3 235B A22B DEP4 1024 8192 1,417 output tokens/sec/gpu 4x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
Qwen3 235B A22B DEP4 8192 1024 627 output tokens/sec/gpu 4x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
Qwen3 235B A22B DEP4 32768 1024 134 output tokens/sec/gpu 4x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
Llama v4 Maverick DEP8 1000 1000 4,146 output tokens/sec/gpu 8x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
Llama v4 Maverick DEP8 1024 8192 1,157 output tokens/sec/gpu 8x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
Llama v4 Maverick DEP8 1024 32768 679 output tokens/sec/gpu 8x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
Llama v4 Maverick DEP8 8192 1024 1,276 output tokens/sec/gpu 8x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
GPT-OSS 20B TP1 1000 1000 13,858 output tokens/sec/gpu 1x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
GPT-OSS 20B TP1 1024 8192 12,743 output tokens/sec/gpu 1x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
GPT-OSS 20B TP1 1024 32768 output tokens/sec/gpu 1x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
GPT-OSS 20B TP1 8192 1024 4,015 output tokens/sec/gpu 1x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
GPT-OSS 20B TP1 32768 1024 9,154 output tokens/sec/gpu 1x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

H100 Inference Performance - Max Throughput

Model Parallelism Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Qwen3 235B A22B DEP8 1000 1000 1,932 output tokens/sec/gpu 8x H100 DGX H100 FP8 TensorRT-LLM 1.1 H100-SXM5-80GB
Qwen3 235B A22B DEP8 1024 8192 873 output tokens/sec/gpu 8x H100 DGX H100 FP8 TensorRT-LLM 1.1 H100-SXM5-80GB
GPT-OSS 20B TP1 1000 1000 11,557 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.1 H100-SXM5-80GB
GPT-OSS 20B TP1 1024 8192 8,617 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.1 H100-SXM5-80GB
GPT-OSS 20B TP1 8192 1024 3,366 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.1 H100-SXM5-80GB
GPT-OSS 20B TP1 32768 1024 785 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.1 H100-SXM5-80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

L40S Inference Performance - Max Throughput

Model Parallelism Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v4 Scout TP2 PP2 128 2048 1,105 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout TP2 PP2 128 4096 707 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout TP4 2048 128 561 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout TP4 5000 500 307 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout TP2 PP2 500 2000 1,093 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout TP2 PP2 1000 1000 920 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout TP2 PP2 1000 2000 884 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout TP2 PP2 2048 2048 615 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B TP4 128 2048 1,694 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B TP2 PP2 128 4096 972 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B TP4 500 2000 1,413 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B TP4 1000 1000 1,498 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B TP4 1000 2000 1,084 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B TP4 2048 2048 773 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B TP1 128 128 8,471 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B TP1 128 4096 2,888 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B TP1 2048 128 1,017 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B TP1 5000 500 863 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B TP1 500 2000 4,032 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B TP1 1000 2000 3,134 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B TP1 2048 2048 2,148 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B TP1 20000 2000 280 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

Inference Performance of NVIDIA Data Center Products

B200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Video Diffusion 1 7.32 videos/min - 8202.75 1x B200 DGX B200 26.02-py3 Mixed Synthetic TensorRT 10.15.1 NVIDIA B200
Stable Diffusion XL 1 2.89 images/sec - 507.41 1x B200 DGX B200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA B200
BEVFusion Head 1 2464.55 images/sec 6 images/sec/watt 0.41 1x B200 DGX B200 26.02-py3 INT8 Synthetic TensorRT 10.15.1 NVIDIA B200
Flux Image Generator 1 0.47 images/sec - 2130.4 1x B200 DGX B200 26.02-py3 FP4 Synthetic TensorRT 10.15.1 NVIDIA B200
HF Swin Base 128 4,948 samples/sec 6 samples/sec/watt 25.87 1x B200 DGX B200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA B200
HF Swin Large 128 3,223 samples/sec 3 samples/sec/watt 39.71 1x B200 DGX B200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA B200
HF ViT Base 2048 9,480 samples/sec 10 samples/sec/watt 216.04 1x B200 DGX B200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA B200
HF ViT Large 1024 3,381 samples/sec 4 samples/sec/watt 302.83 1x B200 DGX B200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA B200
Yolo v10 M 1 846.98 images/sec 1.19 images/sec/watt 1.18 1x B200 DGX B200 26.02-py3 INT8 Synthetic TensorRT 10.15.1 NVIDIA B200
Yolo v11 M 1 1034.36 images/sec 1.4 images/sec/watt 0.97 1x B200 DGX B200 26.02-py3 INT8 Synthetic TensorRT 10.15.1 NVIDIA B200

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

RTX PRO 6000 Blackwell Server Edition Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion XL 1 1.05 images/sec 954 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT 26.01-py3 FP8 Synthetic TensorRT 10.14.1 RTX PRO 6000 BSE
Flux Image Generator 1 0.2 images/sec - 5072 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT 26.01-py3 FP4 Synthetic TensorRT 10.14.1 RTX PRO 6000 BSE
BEVFusion Head 1 1738.51 images/sec 5 images/sec/watt 0.58 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT 26.02-py3 FP8 Synthetic TensorRT 10.15.1 RTX PRO 6000 BSE
HF Swin Base 32 2,719 samples/sec 5 samples/sec/watt 11.77 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT 26.02-py3 FP8 Synthetic TensorRT 10.15.1 RTX PRO 6000 BSE
HF Swin Large 32 1,517 samples/sec 3 samples/sec/watt 21.1 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT 26.02-py3 FP8 Synthetic TensorRT 10.15.1 RTX PRO 6000 BSE
HF ViT Base 32 4,011 samples/sec - 8 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT 26.01-py3 FP8 Synthetic TensorRT 10.14.1 RTX PRO 6000 BSE
HF ViT Large 16 1,280 samples/sec - 13 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT 26.01-py3 FP8 Synthetic TensorRT 10.14.1 RTX PRO 6000 BSE
Yolo v11 M 1 465 images/sec 1 images/sec/watt 2.15 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT 26.02-py3 FP8 Synthetic TensorRT 10.15.1 RTX PRO 6000 BSE

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

RTX PRO 4500 Blackwell Server Edition Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion XL 1 0.4 images/sec - 2514 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT 26.01-py3 FP8 Synthetic TensorRT 10.14.1 RTX PRO 4500 BSE
Flux Image Generator 1 0.07 images/sec - 13816 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT 26.01-py3 FP4 Synthetic TensorRT 10.14.1 RTX PRO 4500 BSE
HF Bert Large QAT 64 2,720 samples/sec - 24 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT 26.01-py3 INT8 Synthetic TensorRT 10.14.1 RTX PRO 4500 BSE
HF Bert Large 64 1,507 samples/sec - 42 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT 26.01-py3 Mixed Synthetic TensorRT 10.14.1 RTX PRO 4500 BSE
HF ViT Base 16 1,403 samples/sec - 11 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT 26.01-py3 FP8 Synthetic TensorRT 10.14.1 RTX PRO 4500 BSE
HF ViT Large 4 449 samples/sec - 9 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT 26.01-py3 FP8 Synthetic TensorRT 10.14.1 RTX PRO 4500 BSE

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

H200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Video Diffusion 1 4.83 videos/min - 12414.37 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200
Stable Diffusion XL 1 1.61 images/sec - 760.29 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200
BEVFusion Head 1 2006.49 images/sec 6 images/sec/watt 0.5 1x H200 DGX H200 26.02-py3 INT8 Synthetic TensorRT 10.15.1 NVIDIA H200
Flux Image Generator 1 .2 images/sec - 5010.27 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200
HF Swin Base 128 3,009 samples/sec 4 samples/sec/watt 42.54 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200
HF Swin Large 128 1,821 samples/sec 3 samples/sec/watt 70.28 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200
HF ViT Base 1024 4,943 samples/sec 7 samples/sec/watt 207.15 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200
HF ViT Large 1024 1,702 samples/sec 2 samples/sec/watt 601.64 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200
Yolo v10 M 1 431.92 images/sec 0.68 images/sec/watt 2.32 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200
Yolo v11 M 1 518.04 images/sec 0.8 images/sec/watt 1.93 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

GH200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
BEVFusion Head 1 2006.78 images/sec 6 images/sec/watt 0.5 1x GH200 NVIDIA P3880 26.02-py3 INT8 Synthetic TensorRT 10.15.1 NVIDIA GH200
HF Swin Base 128 2,919 samples/sec 4 samples/sec/watt 43.84 1x GH200 NVIDIA P3880 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA GH200
HF Swin Large 128 1,752 samples/sec 3 samples/sec/watt 73.04 1x GH200 NVIDIA P3880 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA GH200
HF ViT Base 1024 4,728 samples/sec 7 samples/sec/watt 216.57 1x GH200 NVIDIA P3880 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA GH200
HF ViT Large 2048 1,629 samples/sec 2 samples/sec/watt 1256.97 1x GH200 NVIDIA P3880 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA GH200
Yolo v10 M 1 433.06 images/sec 0.66 images/sec/watt 2.31 1x GH200 NVIDIA P3880 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA GH200
Yolo v11 M 1 505.3 images/sec 0.8 images/sec/watt 1.98 1x GH200 NVIDIA P3880 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA GH200

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

H100 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Video Diffusion 1 4.68 videos/min - 12811.33 1x H100 DGX H100 26.02-py3 FP8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB
Stable Diffusion XL 1 1.54 images/sec - 780.31 1x H100 DGX H100 26.02-py3 FP8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB
BEVFusion Head 1 1999.27 images/sec 6 images/sec/watt 0.5 1x H100 DGX H100 26.02-py3 INT8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB
HF Swin Base 128 2,866 samples/sec 4 samples/sec/watt 44.67 1x H100 DGX H100 26.02-py3 FP8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB
HF Swin Large 128 1,767 samples/sec 3 samples/sec/watt 72.42 1x H100 DGX H100 26.02-py3 FP8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB
HF ViT Base 2048 4,864 samples/sec 7 samples/sec/watt 421.03 1x H100 DGX H100 26.02-py3 FP8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB
HF ViT Large 2048 1,679 samples/sec 2 samples/sec/watt 1219.62 1x H100 DGX H100 26.02-py3 FP8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB
Yolo v10 M 1 403.68 images/sec 0.68 images/sec/watt 2.48 1x H100 DGX H100 26.02-py3 FP8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB
Yolo v11 M 1 476 images/sec 0.76 images/sec/watt 2.1 1x H100 DGX H100 26.02-py3 FP8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

L40S Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
BEVFusion Head 1 1958.07 images/sec 7 images/sec/watt 0.51 1x L40S Supermicro SYS-521GE-TNRT 26.02-py3 INT8 Synthetic TensorRT 10.15.1 NVIDIA L40S
HF Swin Base 32 1,396 samples/sec 4 samples/sec/watt 22.92 1x L40S Supermicro SYS-521GE-TNRT 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA L40S
HF Swin Large 32 716 samples/sec 2 samples/sec/watt 44.72 1x L40S Supermicro SYS-521GE-TNRT 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA L40S
HF ViT Base 1024 1,662 samples/sec 5 samples/sec/watt 616.09 1x L40S Supermicro SYS-521GE-TNRT 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA L40S
HF ViT Large 1024 597 samples/sec 2 samples/sec/watt 1716.6 1x L40S Supermicro SYS-521GE-TNRT 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA L40S
Yolo v10 M 1 274.78 images/sec 0.79 images/sec/watt 3.64 1x L40S Supermicro SYS-521GE-TNRT 26.02-py3 INT8 Synthetic TensorRT 10.15.1 NVIDIA L40S
Yolo v11 M 1 310 images/sec 0.9 images/sec/watt 3.23 1x L40S Supermicro SYS-521GE-TNRT 26.02-py3 INT8 Synthetic TensorRT 10.15.1 NVIDIA L40S

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

View More Performance Data

Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.