AI Inference

The days of raw speed being the only metric that matters are behind us. Now it’s about throughput, efficiency, and economics at scale. As AI evolves from providing one-shot answers to engaging in multi-step reasoning, the demand for inference and its underlying economics is increasing. This shift significantly boosts compute demand due to the generation of far more tokens per query. Metrics such as tokens per watt, cost per million tokens, and tokens per second per user are crucial alongside throughput. For power-limited AI factories, NVIDIA's continuous software improvements translate into higher token revenue over time, underscoring the importance of our technological advancements.

Pareto curves illustrate how NVIDIA Blackwell provides the best balance across the full spectrum of production priorities, including cost, energy efficiency, throughput, and responsiveness. Optimizing systems for a single scenario can limit deployment flexibility,‌ leading to inefficiencies at other points on the curve. NVIDIA’s full-stack design approach ensures efficiency and value across multiple real-life production scenarios. Blackwell’s leadership stems from its extreme hardware-software co-design, embodying a full-stack architecture built for speed, efficiency, and scalability.

Learn about how Mixture of Experts Powers the Most Intelligent Frontier AI Models, Runs 10x Faster on NVIDIA Blackwell NVL72 in this blog.

Explore the methodology used to obtain these results and learn how to replicate the tests by executing Benchmarking Recipes yourself.

MLPerf Inference v6.0 Performance Benchmarks

Offline Scenario, Closed Division

Network Throughput GPU Server GPU Version QSL Size Target Accuracy Dataset
DeepSeek R1 2,494,310 tokens/sec 288x GB300 NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) NVIDIA GB300 4388 99% of FP16 (exact match 81.9132%) mlperf_deepseek_r1
486,141 tokens/sec 72x GB200 NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 4388 99% of FP16 (exact match 81.9132%) mlperf_deepseek_r1
70,326 tokens/sec 8x B300 NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT) NVIDIA B300 4388 99% of FP16 (exact match 81.9132%) mlperf_deepseek_r1
58,582 tokens/sec 8x B200 Nebius B200 n1 (8x B200-SXM-180GB, TensorRT) NVIDIA B200 4388 99% of FP16 (exact match 81.9132%) mlperf_deepseek_r1
gpt-oss 120B 1,046,150 tokens/sec 72x GB300 Nebius GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) NVIDIA GB300 6396 99% of 83.13% AIME25, GPQA Diamond, LiveCodeBench v6
879,542 tokens/sec 72x GB200 NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 6396 99% of 83.13% AIME25, GPQA Diamond, LiveCodeBench v6
111,496 tokens/sec 8x B300 Cisco UCS C880A M8 (8x NVIDIA B300-SXM-270GB, TensorRT) NVIDIA B300 6396 99% of 83.13% AIME25, GPQA Diamond, LiveCodeBench v6
93,071 tokens/sec 8x B200 LLM-D v0.5.0,Openshift 4.20.12,NVIDIA 8xB200-SXM-180GB NVIDIA B200 6396 99% of 83.13% AIME25, GPQA Diamond, LiveCodeBench v6
Qwen3-VL 235B 61 tokens/sec 4x GB300 NVIDIA GB300 NVL72 (4x GB300-288GB_aarch64, TensorRT) NVIDIA GB300 48289 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) Shopify Product Catalogue
44 tokens/sec 4x GB200 NVIDIA GB200 NVL72 (4x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 48289 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) Shopify Product Catalogue
78 tokens/sec 8x B300 Nebius B300 n1 (8x B300-SXM-270GB, TensorRT) NVIDIA B300 48289 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) Shopify Product Catalogue
79 tokens/sec 8x B200 Dell B200,8xB200-SXM-180GB,RHEL 10.1,vLLM CentML:mlperf-inf-mm-q3vl-v6.0 NVIDIA B200 48289 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) Shopify Product Catalogue
Llama3.1 405B 19,512 tokens/sec 72x GB300 NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) NVIDIA GB300 8313 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) Subset of LongBench, LongDataCollections, Ruler, GovReport
15,462 tokens/sec 72x GB200 NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 8313 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) Subset of LongBench, LongDataCollections, Ruler, GovReport
1,971 tokens/sec 8x B300 Cisco UCS C880A M8 (8x NVIDIA B300-SXM-270GB, TensorRT) NVIDIA B300 8313 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) Subset of LongBench, LongDataCollections, Ruler, GovReport
1,350 tokens/sec 8x B200 NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) NVIDIA B200 8313 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B 1,126,850 tokens/sec 72x GB300 NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) NVIDIA GB300 24576 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) OpenOrca (max_seq_len=1024)
888,054 tokens/sec 72x GB200 NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 24576 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) OpenOrca (max_seq_len=1024)
112,954 tokens/sec 8x B300 NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT) NVIDIA B300 24576 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) OpenOrca (max_seq_len=1024)
104,572 tokens/sec 8x B200 HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT) NVIDIA B200 24576 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) OpenOrca (max_seq_len=1024)
Llama3.1 8B 166,745 tokens/sec 8x B300 XA NB3I-E12 NVIDIA B300 13368 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) CNN Dailymail (v3.0.0, max_seq_len=2048)
160,403 tokens/sec 8x B200 NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) NVIDIA B200 13368 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) CNN Dailymail (v3.0.0, max_seq_len=2048)
Wan2.2 0.037 samples/sec 4x GB300 NVIDIA GB300 NVL72 (4x GB300-288GB_aarch64, TensorRT) NVIDIA GB300 248 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) VBench prompts
0.027 samples/sec 4x GB200 NVIDIA GB200 NVL72 (4x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 248 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) VBench prompts
0.059 samples/sec 8x B300 NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT) NVIDIA B300 248 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) VBench prompts
0.046 samples/sec 8x B200 NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) NVIDIA B200 248 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) VBench prompts
DLRMv3 104,637 samples/sec 72x GB200 NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 34996 99% of FP32 and 99.9% of FP32 (AUC=80.31%) Synthetic Streaming 100B Dataset
10,737 samples/sec 8x B200 Camarero PDI200A2HG-810 (8x B200-SXM-180GB, TensorRT) NVIDIA B200 34996 99% of FP32 and 99.9% of FP32 (WER=2.0671%) Synthetic Streaming 100B Dataset
Whisper 50,562 samples/sec 8x B300 NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT) NVIDIA B300 1633 99% of FP32 and 99.9% of FP32 (WER=2.0671%) LibriSpeech
49,327 samples/sec 8x B200 NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) NVIDIA B200 1633 99% of FP32 and 99.9% of FP32 (WER=2.0671%) LibriSpeech

Server Scenario - Closed Division

Network Throughput GPU Server GPU Version QSL Size Target Accuracy MLPerf Server Latency
Constraints (ms)
Dataset
DeepSeek R1 1,555,110 tokens/sec 288x GB300 NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) NVIDIA GB300 4388 99% of FP16 (exact match 81.9132%) TTFT/TPOT: 2000 ms/80 ms mlperf_deepseek_r1
336,106 tokens/sec 72x GB200 NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 4388 99% of FP16 (exact match 81.9132%) TTFT/TPOT: 2000 ms/80 ms mlperf_deepseek_r1
60,413 tokens/sec 8x B300 Nebius B300 n1 (8x B300-SXM-270GB, TensorRT) NVIDIA B300 4388 99% of FP16 (exact match 81.9132%) TTFT/TPOT: 2000 ms/80 ms mlperf_deepseek_r1
51,693 tokens/sec 8x B200 Nebius B200 n1 (8x B200-SXM-180GB, TensorRT) NVIDIA B200 4388 99% of FP16 (exact match 81.9132%) TTFT/TPOT: 2000 ms/80 ms mlperf_deepseek_r1
gpt-oss 120B 1,096,770 tokens/sec 72x GB300 Nebius GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) NVIDIA GB300 6396 99% of 83.13% TTFT/TPOT: 3000 ms/80 ms AIME25, GPQA Diamond, LiveCodeBench v6
899,218 tokens/sec 72x GB200 NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 6396 99% of 83.13% TTFT/TPOT: 3000 ms/80 ms AIME25, GPQA Diamond, LiveCodeBench v6
110,655 queries/sec 8x B300 Cisco UCS C880A M8 (8x NVIDIA B300-SXM-270GB, TensorRT) NVIDIA B300 6396 99% of 83.13% TTFT/TPOT: 3000 ms/80 ms AIME25, GPQA Diamond, LiveCodeBench v6
87,444 tokens/sec 8x B200 Nebius B200 n1 (8x B200-SXM-180GB, TensorRT) NVIDIA B200 6396 99% of 83.13% TTFT/TPOT: 3000 ms/80 ms AIME25, GPQA Diamond, LiveCodeBench v6
Qwen3-VL 235B 43 tokens/sec 4x GB300 Nebius GB300 NVL72 (4x GB300-288GB_aarch64, TensorRT) NVIDIA GB300 48289 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) 12 s Shopify Product Catalogue
38 tokens/sec 4x GB200 NVIDIA GB200 NVL72 (4x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 48289 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) 12 s Shopify Product Catalogue
45 queries/sec 8x B300 Nebius B300 n1 (8x B300-SXM-270GB, TensorRT) NVIDIA B300 48289 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) 12 s Shopify Product Catalogue
68 tokens/sec 8x B200 Dell B200,8xB200-SXM-180GB,RHEL 10.1,vLLM CentML:mlperf-inf-mm-q3vl-v6.0 NVIDIA B200 48289 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) 12 s Shopify Product Catalogue
Llama3.1 405B 18,628 tokens/sec 72x GB300 NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) NVIDIA GB300 8313 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
14,134 tokens/sec 72x GB200 NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 8313 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
1,484 tokens/sec 8x B300 QuantaGrid D75H-10U (8x B300-SXM-270GB, TensorRT) NVIDIA B300 8313 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
984 tokens/sec 8x B200 NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) NVIDIA B200 8313 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B 868,278 tokens/sec 72x GB300 NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) NVIDIA GB300 24576 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) TTFT/TPOT: 2000 ms/200 ms OpenOrca (max_seq_len=1024)
810,104 tokens/sec 72x B200 NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 24576 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) TTFT/TPOT: 2000 ms/200 ms OpenOrca (max_seq_len=1024)
108,392 tokens/sec 8x B300 PowerEdge XE9780L (8x B300-SXM-270GB, TensorRT) NVIDIA B300 24576 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) TTFT/TPOT: 2000 ms/200 ms OpenOrca (max_seq_len=1024)
103,627 tokens/sec 8x B200 HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT) NVIDIA B200 24576 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) TTFT/TPOT: 2000 ms/200 ms OpenOrca (max_seq_len=1024)
Llama3.1 8B 148,067 tokens/sec 8x B300 XA NB3I-E12 NVIDIA B300 13368 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) TTFT/TPOT: 2000 ms/100 ms CNN Dailymail (v3.0.0, max_seq_len=2048)
131,270 queries/sec 8x B200 HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT) NVIDIA B200 13368 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) TTFT/TPOT: 2000 ms/100 ms CNN Dailymail (v3.0.0, max_seq_len=2048)
Wan2.2** 31 seconds 4x GB300 NVIDIA GB300 NVL72 (4x GB300-288GB_aarch64, TensorRT) NVIDIA GB300 248 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) N/A VBench prompts
40 seconds 4x GB200 NVIDIA GB200 NVL72 (4x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 248 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) N/A VBench prompts
21 seconds 8x B300 G894-SD3-AAX7 NVIDIA B300 248 FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] N/A VBench prompts
25 seconds 8x B200 NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) NVIDIA B200 248 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) N/A VBench prompts
DLRMv3 99,997 queries/sec 72x GB200 NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 34996 99% of FP32 (AUC=80.31%) 80 ms Synthetic Streaming 100B Dataset
10,007 queries/sec 8x B200 NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) NVIDIA B200 34996 FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] 80 ms Synthetic Streaming 100B Dataset

Interactive Scenario - Closed Division

Network Throughput GPU Server GPU Version QSL Size Target Accuracy MLPerf Server Latency
Constraints (ms)
Dataset
DeepSeek R1 250,634 tokens/sec 72x GB300 NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) NVIDIA GB300 4388 99% of FP16 (exact match 81.9132%) TTFT/TPOT: 1500 ms/15 ms mlperf_deepseek_r1
240,318 tokens/sec 72x GB200 NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 4388 99% of FP16 (exact match 81.9132%) TTFT/TPOT: 1500 ms/15 ms mlperf_deepseek_r1
4,935 tokens/sec 8x B300 G894-SD3-AAX7 NVIDIA B300 4388 99% of FP16 (exact match 81.9132%) TTFT/TPOT: 1500 ms/15 ms mlperf_deepseek_r1
gpt-oss 120B 677,199 tokens/sec 72x GB300 NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) NVIDIA GB300 6396 99% of 83.13% TTFT/TPOT: 2000 ms/20 ms AIME25, GPQA Diamond, LiveCodeBench v6
624,929 tokens/sec 72x GB200 NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 6396 99% of 83.13% TTFT/TPOT: 2000 ms/20 ms AIME25, GPQA Diamond, LiveCodeBench v6
26,006 tokens/sec 8x B300 XA NB3I-E12 NVIDIA B300 6396 99% of 83.13% TTFT/TPOT: 2000 ms/20 ms AIME25, GPQA Diamond, LiveCodeBench v6
13,155 tokens/sec 8x B200 Nebius B200 n1 (8x B200-SXM-180GB, TensorRT) NVIDIA B200 6396 99% of 83.13% TTFT/TPOT: 2000 ms/20 ms AIME25, GPQA Diamond, LiveCodeBench v6
Llama3.1 405B 18,365 tokens/sec 72x GB300 NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) NVIDIA GB300 8313 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) TTFT/TPOT: 4500 ms/80 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
14,010 tokens/sec 72x GB200 NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 8313 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) TTFT/TPOT: 4500 ms/80 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
765 tokens/sec 8x B300 G894-SD3-AAX7 NVIDIA B300 8313 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) TTFT/TPOT: 4500 ms/80 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B 814,128 tokens/sec 72x GB300 NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) NVIDIA GB300 24576 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) TTFT/TPOT: 450 ms/40 ms OpenOrca (max_seq_len=1024)
754,855 tokens/sec 72x GB200 NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) NVIDIA GB200 24576 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) TTFT/TPOT: 450 ms/40 ms OpenOrca (max_seq_len=1024)
70,724 tokens/sec 8x B300 PowerEdge XE9780L (8x B300-SXM-270GB, TensorRT) NVIDIA B300 24576 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) TTFT/TPOT: 450 ms/40 ms OpenOrca (max_seq_len=1024)
61,300 tokens/sec 8x B200 HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT) NVIDIA B200 24576 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) TTFT/TPOT: 450 ms/40 ms OpenOrca (max_seq_len=1024)
Llama3.1 8B 128,633 tokens/sec 8x B300 G894-SD3-AAX7 NVIDIA B300 13368 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) TTFT/TPOT: 500 ms/30 ms CNN Dailymail (v3.0.0, max_seq_len=2048)
128,750 tokens/sec 8x B200 NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) NVIDIA B200 13368 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) TTFT/TPOT: 500 ms/30 ms CNN Dailymail (v3.0.0, max_seq_len=2048)

**The primary metric on Wan2.2 in Server Scenario is measured in seconds (lower the better).
MLPerf™ v6.0 Inference Closed Division. NVIDIA platform results from the following entries: 6.0-0006, 6.0-0010, 6.0-0024, 6.0-0039, 6.0-0040, 6.0-0048, 6.0-0062, 6.0-0072, 6.0-0073, 6.0-0074, 6.0-0075, 6.0-0076, 6.0-0077, 6.0-0078, 6.0-0080, 6.0-0081, 6.0-0083, 6.0-0084, 6.0-0085, 6.0-0089, 6.0-0091, 6.0-0094, 6.0-0098. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

LLM Inference Performance of NVIDIA Data Center Products

B200 Inference Performance - Max Throughput

Model Parallelism Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Qwen3 235B A22B DEP4 1000 1000 5,764 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 235B A22B DEP4 1024 8192 3,389 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 235B A22B DEP4 1024 32768 1,255 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 235B A22B DEP4 8192 1024 1,410 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 235B A22B DEP4 32768 1024 319 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 30B A3B TP1 1000 1000 26,971 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 30B A3B TP1 1024 8192 13,497 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 30B A3B TP1 1024 32768 4,494 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 30B A3B TP1 8192 1024 5,735 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Qwen3 30B A3B TP1 32768 1024 1,265 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Llama v4 Maverick DEP4 1000 1000 11,337 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Llama v4 Maverick DEP4 1024 8192 5,174 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Llama v4 Maverick DEP4 1024 32768 2,204 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Llama v4 Maverick DEP4 8192 1024 3,279 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
Llama v4 Maverick DEP4 32768 1024 859 output tokens/sec/gpu 4x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
GPT-OSS 20B TP1 1000 1000 53,812 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
GPT-OSS 20B TP1 1024 8192 34,702 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
GPT-OSS 20B TP1 1024 32768 14,589 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
GPT-OSS 20B TP1 8192 1024 11,904 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200
GPT-OSS 20B TP1 32768 1024 2,645 output tokens/sec/gpu 1x B200 DGX B200 FP4 TensorRT-LLM 1.1 NVIDIA B200

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

RTX PRO 6000 Blackwell Server Edition Inference Performance - Max Throughput

Model Parallelism Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Qwen3 235B A22B DEP2 PP2 1000 1000 1,731 output tokens/sec/gpu 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.1 NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 235B A22B DEP8 1024 8192 711 output tokens/sec/gpu 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.1 NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 235B A22B DEP2 PP2 32768 1024 70 output tokens/sec/gpu 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.1 NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 30B A3B TP1 1000 1000 9,938 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.1 NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 30B A3B TP1 1024 8192 3,621 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.1 NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 30B A3B TP1 8192 1024 1,914 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.1 NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 30B A3B TP1 32768 1024 374 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.1 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 9B v2 TP1 500 500 1,711 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 9B v2 TP1 1000 4000 790 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 9B v2 TP1 4000 1000 1,238 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 12B v2 TP1 500 500 1,229 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 12B v2 TP1 1000 4000 1,202 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 12B v2 TP1 4000 1000 1,071 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron 3 Nano 30B TP1 500 500 6,616 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron 3 Nano 30B TP1 1000 4000 4,957 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron 3 Nano 30B TP1 4000 1000 5,353 output tokens/sec/gpu 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 6000 Blackwell Server Edition

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

RTX PRO 4500 Blackwell Server Edition Inference Performance - Max Throughput

Model Parallelism Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Nemotron Nano 9B v2 TP1 500 500 945 output tokens/sec/gpu 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 4500 Blackwell Server Edition
Nemotron Nano 9B v2 TP1 1000 4000 410 output tokens/sec/gpu 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 4500 Blackwell Server Edition
Nemotron Nano 9B v2 TP1 4000 1000 636 output tokens/sec/gpu 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 4500 Blackwell Server Edition
Nemotron Nano 12B v2 TP1 500 500 678 output tokens/sec/gpu 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 4500 Blackwell Server Edition
Nemotron Nano 12B v2 TP1 1000 4000 681 output tokens/sec/gpu 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 4500 Blackwell Server Edition
Nemotron Nano 12B v2 TP1 4000 1000 566 output tokens/sec/gpu 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 1.2.0 NVIDIA RTX PRO 4500 Blackwell Server Edition

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

H200 Inference Performance - Max Throughput

Model Parallelism Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Qwen3 235B A22B DEP4 1000 1000 3,288 output tokens/sec/gpu 4x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
Qwen3 235B A22B DEP4 1024 8192 1,417 output tokens/sec/gpu 4x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
Qwen3 235B A22B DEP4 8192 1024 627 output tokens/sec/gpu 4x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
Qwen3 235B A22B DEP4 32768 1024 134 output tokens/sec/gpu 4x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
Llama v4 Maverick DEP8 1000 1000 4,146 output tokens/sec/gpu 8x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
Llama v4 Maverick DEP8 1024 8192 1,157 output tokens/sec/gpu 8x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
Llama v4 Maverick DEP8 1024 32768 679 output tokens/sec/gpu 8x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
Llama v4 Maverick DEP8 8192 1024 1,276 output tokens/sec/gpu 8x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
GPT-OSS 20B TP1 1000 1000 13,858 output tokens/sec/gpu 1x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
GPT-OSS 20B TP1 1024 8192 12,743 output tokens/sec/gpu 1x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
GPT-OSS 20B TP1 1024 32768 output tokens/sec/gpu 1x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
GPT-OSS 20B TP1 8192 1024 4,015 output tokens/sec/gpu 1x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200
GPT-OSS 20B TP1 32768 1024 9,154 output tokens/sec/gpu 1x H200 DGX H200 FP8 TensorRT-LLM 1.1 NVIDIA H200

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

H100 Inference Performance - Max Throughput

Model Parallelism Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Qwen3 235B A22B DEP8 1000 1000 1,932 output tokens/sec/gpu 8x H100 DGX H100 FP8 TensorRT-LLM 1.1 H100-SXM5-80GB
Qwen3 235B A22B DEP8 1024 8192 873 output tokens/sec/gpu 8x H100 DGX H100 FP8 TensorRT-LLM 1.1 H100-SXM5-80GB
GPT-OSS 20B TP1 1000 1000 11,557 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.1 H100-SXM5-80GB
GPT-OSS 20B TP1 1024 8192 8,617 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.1 H100-SXM5-80GB
GPT-OSS 20B TP1 8192 1024 3,366 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.1 H100-SXM5-80GB
GPT-OSS 20B TP1 32768 1024 785 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.1 H100-SXM5-80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

L40S Inference Performance - Max Throughput

Model Parallelism Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v4 Scout TP2 PP2 128 2048 1,105 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout TP2 PP2 128 4096 707 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout TP4 2048 128 561 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout TP4 5000 500 307 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout TP2 PP2 500 2000 1,093 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout TP2 PP2 1000 1000 920 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout TP2 PP2 1000 2000 884 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout TP2 PP2 2048 2048 615 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B TP4 128 2048 1,694 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B TP2 PP2 128 4096 972 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B TP4 500 2000 1,413 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B TP4 1000 1000 1,498 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B TP4 1000 2000 1,084 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B TP4 2048 2048 773 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B TP1 128 128 8,471 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B TP1 128 4096 2,888 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B TP1 2048 128 1,017 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B TP1 5000 500 863 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B TP1 500 2000 4,032 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B TP1 1000 2000 3,134 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B TP1 2048 2048 2,148 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B TP1 20000 2000 280 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

Inference Performance of NVIDIA Data Center Products

B200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Video Diffusion 1 7.32 videos/min - 8202.75 1x B200 DGX B200 26.02-py3 Mixed Synthetic TensorRT 10.15.1 NVIDIA B200
Stable Diffusion XL 1 2.89 images/sec - 507.41 1x B200 DGX B200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA B200
BEVFusion Head 1 2464.55 images/sec 6 images/sec/watt 0.41 1x B200 DGX B200 26.02-py3 INT8 Synthetic TensorRT 10.15.1 NVIDIA B200
Flux Image Generator 1 0.47 images/sec - 2130.4 1x B200 DGX B200 26.02-py3 FP4 Synthetic TensorRT 10.15.1 NVIDIA B200
HF Swin Base 128 4,948 samples/sec 6 samples/sec/watt 25.87 1x B200 DGX B200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA B200
HF Swin Large 128 3,223 samples/sec 3 samples/sec/watt 39.71 1x B200 DGX B200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA B200
HF ViT Base 2048 9,480 samples/sec 10 samples/sec/watt 216.04 1x B200 DGX B200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA B200
HF ViT Large 1024 3,381 samples/sec 4 samples/sec/watt 302.83 1x B200 DGX B200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA B200
Yolo v10 M 1 846.98 images/sec 1.19 images/sec/watt 1.18 1x B200 DGX B200 26.02-py3 INT8 Synthetic TensorRT 10.15.1 NVIDIA B200
Yolo v11 M 1 1034.36 images/sec 1.4 images/sec/watt 0.97 1x B200 DGX B200 26.02-py3 INT8 Synthetic TensorRT 10.15.1 NVIDIA B200

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

RTX PRO 6000 Blackwell Server Edition Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion XL 1 1.05 images/sec 954 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT 26.01-py3 FP8 Synthetic TensorRT 10.14.1 RTX PRO 6000 BSE
Flux Image Generator 1 0.2 images/sec - 5072 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT 26.01-py3 FP4 Synthetic TensorRT 10.14.1 RTX PRO 6000 BSE
BEVFusion Head 1 1738.51 images/sec 5 images/sec/watt 0.58 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT 26.02-py3 FP8 Synthetic TensorRT 10.15.1 RTX PRO 6000 BSE
HF Swin Base 32 2,719 samples/sec 5 samples/sec/watt 11.77 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT 26.02-py3 FP8 Synthetic TensorRT 10.15.1 RTX PRO 6000 BSE
HF Swin Large 32 1,517 samples/sec 3 samples/sec/watt 21.1 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT 26.02-py3 FP8 Synthetic TensorRT 10.15.1 RTX PRO 6000 BSE
HF ViT Base 32 4,011 samples/sec - 8 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT 26.01-py3 FP8 Synthetic TensorRT 10.14.1 RTX PRO 6000 BSE
HF ViT Large 16 1,280 samples/sec - 13 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT 26.01-py3 FP8 Synthetic TensorRT 10.14.1 RTX PRO 6000 BSE
Yolo v11 M 1 465 images/sec 1 images/sec/watt 2.15 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT 26.02-py3 FP8 Synthetic TensorRT 10.15.1 RTX PRO 6000 BSE

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

RTX PRO 4500 Blackwell Server Edition Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion XL 1 0.4 images/sec - 2514 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT 26.01-py3 FP8 Synthetic TensorRT 10.14.1 RTX PRO 4500 BSE
Flux Image Generator 1 0.07 images/sec - 13816 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT 26.01-py3 FP4 Synthetic TensorRT 10.14.1 RTX PRO 4500 BSE
HF Bert Large QAT 64 2,720 samples/sec - 24 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT 26.01-py3 INT8 Synthetic TensorRT 10.14.1 RTX PRO 4500 BSE
HF Bert Large 64 1,507 samples/sec - 42 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT 26.01-py3 Mixed Synthetic TensorRT 10.14.1 RTX PRO 4500 BSE
HF ViT Base 16 1,403 samples/sec - 11 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT 26.01-py3 FP8 Synthetic TensorRT 10.14.1 RTX PRO 4500 BSE
HF ViT Large 4 449 samples/sec - 9 1x RTX PRO 4500 Supermicro SYS-521GE-TNRT 26.01-py3 FP8 Synthetic TensorRT 10.14.1 RTX PRO 4500 BSE

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

H200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Video Diffusion 1 4.83 videos/min - 12414.37 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200
Stable Diffusion XL 1 1.61 images/sec - 760.29 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200
BEVFusion Head 1 2006.49 images/sec 6 images/sec/watt 0.5 1x H200 DGX H200 26.02-py3 INT8 Synthetic TensorRT 10.15.1 NVIDIA H200
Flux Image Generator 1 .2 images/sec - 5010.27 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200
HF Swin Base 128 3,009 samples/sec 4 samples/sec/watt 42.54 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200
HF Swin Large 128 1,821 samples/sec 3 samples/sec/watt 70.28 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200
HF ViT Base 1024 4,943 samples/sec 7 samples/sec/watt 207.15 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200
HF ViT Large 1024 1,702 samples/sec 2 samples/sec/watt 601.64 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200
Yolo v10 M 1 431.92 images/sec 0.68 images/sec/watt 2.32 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200
Yolo v11 M 1 518.04 images/sec 0.8 images/sec/watt 1.93 1x H200 DGX H200 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA H200

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

GH200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
BEVFusion Head 1 2006.78 images/sec 6 images/sec/watt 0.5 1x GH200 NVIDIA P3880 26.02-py3 INT8 Synthetic TensorRT 10.15.1 NVIDIA GH200
HF Swin Base 128 2,919 samples/sec 4 samples/sec/watt 43.84 1x GH200 NVIDIA P3880 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA GH200
HF Swin Large 128 1,752 samples/sec 3 samples/sec/watt 73.04 1x GH200 NVIDIA P3880 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA GH200
HF ViT Base 1024 4,728 samples/sec 7 samples/sec/watt 216.57 1x GH200 NVIDIA P3880 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA GH200
HF ViT Large 2048 1,629 samples/sec 2 samples/sec/watt 1256.97 1x GH200 NVIDIA P3880 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA GH200
Yolo v10 M 1 433.06 images/sec 0.66 images/sec/watt 2.31 1x GH200 NVIDIA P3880 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA GH200
Yolo v11 M 1 505.3 images/sec 0.8 images/sec/watt 1.98 1x GH200 NVIDIA P3880 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA GH200

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

H100 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Video Diffusion 1 4.68 videos/min - 12811.33 1x H100 DGX H100 26.02-py3 FP8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB
Stable Diffusion XL 1 1.54 images/sec - 780.31 1x H100 DGX H100 26.02-py3 FP8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB
BEVFusion Head 1 1999.27 images/sec 6 images/sec/watt 0.5 1x H100 DGX H100 26.02-py3 INT8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB
HF Swin Base 128 2,866 samples/sec 4 samples/sec/watt 44.67 1x H100 DGX H100 26.02-py3 FP8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB
HF Swin Large 128 1,767 samples/sec 3 samples/sec/watt 72.42 1x H100 DGX H100 26.02-py3 FP8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB
HF ViT Base 2048 4,864 samples/sec 7 samples/sec/watt 421.03 1x H100 DGX H100 26.02-py3 FP8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB
HF ViT Large 2048 1,679 samples/sec 2 samples/sec/watt 1219.62 1x H100 DGX H100 26.02-py3 FP8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB
Yolo v10 M 1 403.68 images/sec 0.68 images/sec/watt 2.48 1x H100 DGX H100 26.02-py3 FP8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB
Yolo v11 M 1 476 images/sec 0.76 images/sec/watt 2.1 1x H100 DGX H100 26.02-py3 FP8 Synthetic TensorRT 10.15.1 H100 SXM5-80GB

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

L40S Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
BEVFusion Head 1 1958.07 images/sec 7 images/sec/watt 0.51 1x L40S Supermicro SYS-521GE-TNRT 26.02-py3 INT8 Synthetic TensorRT 10.15.1 NVIDIA L40S
HF Swin Base 32 1,396 samples/sec 4 samples/sec/watt 22.92 1x L40S Supermicro SYS-521GE-TNRT 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA L40S
HF Swin Large 32 716 samples/sec 2 samples/sec/watt 44.72 1x L40S Supermicro SYS-521GE-TNRT 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA L40S
HF ViT Base 1024 1,662 samples/sec 5 samples/sec/watt 616.09 1x L40S Supermicro SYS-521GE-TNRT 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA L40S
HF ViT Large 1024 597 samples/sec 2 samples/sec/watt 1716.6 1x L40S Supermicro SYS-521GE-TNRT 26.02-py3 FP8 Synthetic TensorRT 10.15.1 NVIDIA L40S
Yolo v10 M 1 274.78 images/sec 0.79 images/sec/watt 3.64 1x L40S Supermicro SYS-521GE-TNRT 26.02-py3 INT8 Synthetic TensorRT 10.15.1 NVIDIA L40S
Yolo v11 M 1 310 images/sec 0.9 images/sec/watt 3.23 1x L40S Supermicro SYS-521GE-TNRT 26.02-py3 INT8 Synthetic TensorRT 10.15.1 NVIDIA L40S

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

View More Performance Data

Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.