The days of raw speed being the only metric that matters are behind us. Now it’s about throughput, efficiency, and economics at scale. As AI evolves from providing one-shot answers to engaging in multi-step reasoning, the demand for inference and its underlying economics is increasing. This shift significantly boosts compute demand due to the generation of far more tokens per query. Metrics such as tokens per watt, cost per million tokens, and tokens per second per user are crucial alongside throughput. For power-limited AI factories, NVIDIA's continuous software improvements translate into higher token revenue over time, underscoring the importance of our technological advancements.
Pareto curves illustrate how NVIDIA Blackwell provides the best balance across the full spectrum of production priorities, including cost, energy efficiency, throughput, and responsiveness. Optimizing systems for a single scenario can limit deployment flexibility, leading to inefficiencies at other points on the curve. NVIDIA’s full-stack design approach ensures efficiency and value across multiple real-life production scenarios. Blackwell’s leadership stems from its extreme hardware-software co-design, embodying a full-stack architecture built for speed, efficiency, and scalability.
Learn about how Mixture of Experts Powers the Most Intelligent Frontier AI Models, Runs 10x Faster on NVIDIA Blackwell NVL72 in this blog.
Explore the methodology used to obtain these results and learn how to replicate the tests by executing Benchmarking Recipes yourself.
| Network | Throughput | GPU | Server | GPU Version | QSL Size | Target Accuracy | Dataset |
|---|---|---|---|---|---|---|---|
| DeepSeek R1 | 2,494,310 tokens/sec | 288x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 4388 | 99% of FP16 (exact match 81.9132%) | mlperf_deepseek_r1 |
| 486,141 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 4388 | 99% of FP16 (exact match 81.9132%) | mlperf_deepseek_r1 | |
| 70,326 tokens/sec | 8x B300 | NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 4388 | 99% of FP16 (exact match 81.9132%) | mlperf_deepseek_r1 | |
| 58,582 tokens/sec | 8x B200 | Nebius B200 n1 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 4388 | 99% of FP16 (exact match 81.9132%) | mlperf_deepseek_r1 | |
| gpt-oss 120B | 1,046,150 tokens/sec | 72x GB300 | Nebius GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 6396 | 99% of 83.13% | AIME25, GPQA Diamond, LiveCodeBench v6 |
| 879,542 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 6396 | 99% of 83.13% | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| 111,496 tokens/sec | 8x B300 | Cisco UCS C880A M8 (8x NVIDIA B300-SXM-270GB, TensorRT) | NVIDIA B300 | 6396 | 99% of 83.13% | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| 93,071 tokens/sec | 8x B200 | LLM-D v0.5.0,Openshift 4.20.12,NVIDIA 8xB200-SXM-180GB | NVIDIA B200 | 6396 | 99% of 83.13% | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| Qwen3-VL 235B | 61 tokens/sec | 4x GB300 | NVIDIA GB300 NVL72 (4x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 48289 | 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) | Shopify Product Catalogue |
| 44 tokens/sec | 4x GB200 | NVIDIA GB200 NVL72 (4x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 48289 | 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) | Shopify Product Catalogue | |
| 78 tokens/sec | 8x B300 | Nebius B300 n1 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 48289 | 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) | Shopify Product Catalogue | |
| 79 tokens/sec | 8x B200 | Dell B200,8xB200-SXM-180GB,RHEL 10.1,vLLM CentML:mlperf-inf-mm-q3vl-v6.0 | NVIDIA B200 | 48289 | 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) | Shopify Product Catalogue | |
| Llama3.1 405B | 19,512 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| 15,462 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
| 1,971 tokens/sec | 8x B300 | Cisco UCS C880A M8 (8x NVIDIA B300-SXM-270GB, TensorRT) | NVIDIA B300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
| 1,350 tokens/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
| Llama2 70B | 1,126,850 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | OpenOrca (max_seq_len=1024) |
| 888,054 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | OpenOrca (max_seq_len=1024) | |
| 112,954 tokens/sec | 8x B300 | NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | OpenOrca (max_seq_len=1024) | |
| 104,572 tokens/sec | 8x B200 | HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT) | NVIDIA B200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | OpenOrca (max_seq_len=1024) | |
| Llama3.1 8B | 166,745 tokens/sec | 8x B300 | XA NB3I-E12 | NVIDIA B300 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) | CNN Dailymail (v3.0.0, max_seq_len=2048) |
| 160,403 tokens/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) | CNN Dailymail (v3.0.0, max_seq_len=2048) | |
| Wan2.2 | 0.037 samples/sec | 4x GB300 | NVIDIA GB300 NVL72 (4x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 248 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) | VBench prompts |
| 0.027 samples/sec | 4x GB200 | NVIDIA GB200 NVL72 (4x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 248 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) | VBench prompts | |
| 0.059 samples/sec | 8x B300 | NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 248 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) | VBench prompts | |
| 0.046 samples/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 248 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) | VBench prompts | |
| DLRMv3 | 104,637 samples/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 34996 | 99% of FP32 and 99.9% of FP32 (AUC=80.31%) | Synthetic Streaming 100B Dataset |
| 10,737 samples/sec | 8x B200 | Camarero PDI200A2HG-810 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 34996 | 99% of FP32 and 99.9% of FP32 (WER=2.0671%) | Synthetic Streaming 100B Dataset | |
| Whisper | 50,562 samples/sec | 8x B300 | NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 1633 | 99% of FP32 and 99.9% of FP32 (WER=2.0671%) | LibriSpeech |
| 49,327 samples/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 1633 | 99% of FP32 and 99.9% of FP32 (WER=2.0671%) | LibriSpeech |
| Network | Throughput | GPU | Server | GPU Version | QSL Size | Target Accuracy | MLPerf Server Latency
Constraints (ms) |
Dataset |
|---|---|---|---|---|---|---|---|---|
| DeepSeek R1 | 1,555,110 tokens/sec | 288x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 2000 ms/80 ms | mlperf_deepseek_r1 |
| 336,106 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 2000 ms/80 ms | mlperf_deepseek_r1 | |
| 60,413 tokens/sec | 8x B300 | Nebius B300 n1 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 2000 ms/80 ms | mlperf_deepseek_r1 | |
| 51,693 tokens/sec | 8x B200 | Nebius B200 n1 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 2000 ms/80 ms | mlperf_deepseek_r1 | |
| gpt-oss 120B | 1,096,770 tokens/sec | 72x GB300 | Nebius GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 6396 | 99% of 83.13% | TTFT/TPOT: 3000 ms/80 ms | AIME25, GPQA Diamond, LiveCodeBench v6 |
| 899,218 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 6396 | 99% of 83.13% | TTFT/TPOT: 3000 ms/80 ms | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| 110,655 queries/sec | 8x B300 | Cisco UCS C880A M8 (8x NVIDIA B300-SXM-270GB, TensorRT) | NVIDIA B300 | 6396 | 99% of 83.13% | TTFT/TPOT: 3000 ms/80 ms | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| 87,444 tokens/sec | 8x B200 | Nebius B200 n1 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 6396 | 99% of 83.13% | TTFT/TPOT: 3000 ms/80 ms | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| Qwen3-VL 235B | 43 tokens/sec | 4x GB300 | Nebius GB300 NVL72 (4x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 48289 | 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) | 12 s | Shopify Product Catalogue |
| 38 tokens/sec | 4x GB200 | NVIDIA GB200 NVL72 (4x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 48289 | 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) | 12 s | Shopify Product Catalogue | |
| 45 queries/sec | 8x B300 | Nebius B300 n1 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 48289 | 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) | 12 s | Shopify Product Catalogue | |
| 68 tokens/sec | 8x B200 | Dell B200,8xB200-SXM-180GB,RHEL 10.1,vLLM CentML:mlperf-inf-mm-q3vl-v6.0 | NVIDIA B200 | 48289 | 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) | 12 s | Shopify Product Catalogue | |
| Llama3.1 405B | 18,628 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | TTFT/TPOT: 6000 ms/175 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| 14,134 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | TTFT/TPOT: 6000 ms/175 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
| 1,484 tokens/sec | 8x B300 | QuantaGrid D75H-10U (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | TTFT/TPOT: 6000 ms/175 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
| 984 tokens/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | TTFT/TPOT: 6000 ms/175 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
| Llama2 70B | 868,278 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca (max_seq_len=1024) |
| 810,104 tokens/sec | 72x B200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca (max_seq_len=1024) | |
| 108,392 tokens/sec | 8x B300 | PowerEdge XE9780L (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca (max_seq_len=1024) | |
| 103,627 tokens/sec | 8x B200 | HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT) | NVIDIA B200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca (max_seq_len=1024) | |
| Llama3.1 8B | 148,067 tokens/sec | 8x B300 | XA NB3I-E12 | NVIDIA B300 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) | TTFT/TPOT: 2000 ms/100 ms | CNN Dailymail (v3.0.0, max_seq_len=2048) |
| 131,270 queries/sec | 8x B200 | HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT) | NVIDIA B200 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) | TTFT/TPOT: 2000 ms/100 ms | CNN Dailymail (v3.0.0, max_seq_len=2048) | |
| Wan2.2** | 31 seconds | 4x GB300 | NVIDIA GB300 NVL72 (4x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 248 | 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) | N/A | VBench prompts |
| 40 seconds | 4x GB200 | NVIDIA GB200 NVL72 (4x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 248 | 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) | N/A | VBench prompts | |
| 21 seconds | 8x B300 | G894-SD3-AAX7 | NVIDIA B300 | 248 | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | N/A | VBench prompts | |
| 25 seconds | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 248 | 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) | N/A | VBench prompts | |
| DLRMv3 | 99,997 queries/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 34996 | 99% of FP32 (AUC=80.31%) | 80 ms | Synthetic Streaming 100B Dataset |
| 10,007 queries/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 34996 | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 80 ms | Synthetic Streaming 100B Dataset |
| Network | Throughput | GPU | Server | GPU Version | QSL Size | Target Accuracy | MLPerf Server Latency
Constraints (ms) |
Dataset |
|---|---|---|---|---|---|---|---|---|
| DeepSeek R1 | 250,634 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 1500 ms/15 ms | mlperf_deepseek_r1 |
| 240,318 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 1500 ms/15 ms | mlperf_deepseek_r1 | |
| 4,935 tokens/sec | 8x B300 | G894-SD3-AAX7 | NVIDIA B300 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 1500 ms/15 ms | mlperf_deepseek_r1 | |
| gpt-oss 120B | 677,199 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 6396 | 99% of 83.13% | TTFT/TPOT: 2000 ms/20 ms | AIME25, GPQA Diamond, LiveCodeBench v6 |
| 624,929 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 6396 | 99% of 83.13% | TTFT/TPOT: 2000 ms/20 ms | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| 26,006 tokens/sec | 8x B300 | XA NB3I-E12 | NVIDIA B300 | 6396 | 99% of 83.13% | TTFT/TPOT: 2000 ms/20 ms | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| 13,155 tokens/sec | 8x B200 | Nebius B200 n1 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 6396 | 99% of 83.13% | TTFT/TPOT: 2000 ms/20 ms | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| Llama3.1 405B | 18,365 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | TTFT/TPOT: 4500 ms/80 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| 14,010 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | TTFT/TPOT: 4500 ms/80 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
| 765 tokens/sec | 8x B300 | G894-SD3-AAX7 | NVIDIA B300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | TTFT/TPOT: 4500 ms/80 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
| Llama2 70B | 814,128 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | TTFT/TPOT: 450 ms/40 ms | OpenOrca (max_seq_len=1024) |
| 754,855 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | TTFT/TPOT: 450 ms/40 ms | OpenOrca (max_seq_len=1024) | |
| 70,724 tokens/sec | 8x B300 | PowerEdge XE9780L (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | TTFT/TPOT: 450 ms/40 ms | OpenOrca (max_seq_len=1024) | |
| 61,300 tokens/sec | 8x B200 | HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT) | NVIDIA B200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | TTFT/TPOT: 450 ms/40 ms | OpenOrca (max_seq_len=1024) | |
| Llama3.1 8B | 128,633 tokens/sec | 8x B300 | G894-SD3-AAX7 | NVIDIA B300 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) | TTFT/TPOT: 500 ms/30 ms | CNN Dailymail (v3.0.0, max_seq_len=2048) |
| 128,750 tokens/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) | TTFT/TPOT: 500 ms/30 ms | CNN Dailymail (v3.0.0, max_seq_len=2048) |
**The primary metric on Wan2.2 in Server Scenario is measured in seconds (lower the better).
MLPerf™ v6.0 Inference Closed Division. NVIDIA platform results from the following entries: 6.0-0006, 6.0-0010, 6.0-0024, 6.0-0039, 6.0-0040, 6.0-0048, 6.0-0062, 6.0-0072, 6.0-0073, 6.0-0074, 6.0-0075, 6.0-0076, 6.0-0077, 6.0-0078, 6.0-0080, 6.0-0081, 6.0-0083, 6.0-0084, 6.0-0085, 6.0-0089, 6.0-0091, 6.0-0094, 6.0-0098. MLPerf name and logo are trademarks. See
https://mlcommons.org/ for more information.
For MLPerf™ various scenario data, click
here
For MLPerf™ latency constraints, click
here
| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3 235B A22B | DEP4 | 1000 | 1000 | 5,764 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 235B A22B | DEP4 | 1024 | 8192 | 3,389 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 235B A22B | DEP4 | 1024 | 32768 | 1,255 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 235B A22B | DEP4 | 8192 | 1024 | 1,410 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 235B A22B | DEP4 | 32768 | 1024 | 319 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 30B A3B | TP1 | 1000 | 1000 | 26,971 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 30B A3B | TP1 | 1024 | 8192 | 13,497 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 30B A3B | TP1 | 1024 | 32768 | 4,494 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 30B A3B | TP1 | 8192 | 1024 | 5,735 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 30B A3B | TP1 | 32768 | 1024 | 1,265 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Llama v4 Maverick | DEP4 | 1000 | 1000 | 11,337 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Llama v4 Maverick | DEP4 | 1024 | 8192 | 5,174 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Llama v4 Maverick | DEP4 | 1024 | 32768 | 2,204 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Llama v4 Maverick | DEP4 | 8192 | 1024 | 3,279 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Llama v4 Maverick | DEP4 | 32768 | 1024 | 859 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| GPT-OSS 20B | TP1 | 1000 | 1000 | 53,812 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| GPT-OSS 20B | TP1 | 1024 | 8192 | 34,702 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| GPT-OSS 20B | TP1 | 1024 | 32768 | 14,589 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| GPT-OSS 20B | TP1 | 8192 | 1024 | 11,904 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| GPT-OSS 20B | TP1 | 32768 | 1024 | 2,645 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism
| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3 235B A22B | DEP2 PP2 | 1000 | 1000 | 1,731 output tokens/sec/gpu | 4x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 235B A22B | DEP8 | 1024 | 8192 | 711 output tokens/sec/gpu | 8x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 235B A22B | DEP2 PP2 | 32768 | 1024 | 70 output tokens/sec/gpu | 4x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 30B A3B | TP1 | 1000 | 1000 | 9,938 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 30B A3B | TP1 | 1024 | 8192 | 3,621 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 30B A3B | TP1 | 8192 | 1024 | 1,914 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 30B A3B | TP1 | 32768 | 1024 | 374 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 9B v2 | TP1 | 500 | 500 | 1,711 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 9B v2 | TP1 | 1000 | 4000 | 790 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 9B v2 | TP1 | 4000 | 1000 | 1,238 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 500 | 500 | 1,229 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 1000 | 4000 | 1,202 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 4000 | 1000 | 1,071 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron 3 Nano 30B | TP1 | 500 | 500 | 6,616 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron 3 Nano 30B | TP1 | 1000 | 4000 | 4,957 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron 3 Nano 30B | TP1 | 4000 | 1000 | 5,353 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism
| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| Nemotron Nano 9B v2 | TP1 | 500 | 500 | 945 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
| Nemotron Nano 9B v2 | TP1 | 1000 | 4000 | 410 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
| Nemotron Nano 9B v2 | TP1 | 4000 | 1000 | 636 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 500 | 500 | 678 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 1000 | 4000 | 681 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 4000 | 1000 | 566 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism
| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3 235B A22B | DEP4 | 1000 | 1000 | 3,288 output tokens/sec/gpu | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Qwen3 235B A22B | DEP4 | 1024 | 8192 | 1,417 output tokens/sec/gpu | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Qwen3 235B A22B | DEP4 | 8192 | 1024 | 627 output tokens/sec/gpu | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Qwen3 235B A22B | DEP4 | 32768 | 1024 | 134 output tokens/sec/gpu | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Llama v4 Maverick | DEP8 | 1000 | 1000 | 4,146 output tokens/sec/gpu | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Llama v4 Maverick | DEP8 | 1024 | 8192 | 1,157 output tokens/sec/gpu | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Llama v4 Maverick | DEP8 | 1024 | 32768 | 679 output tokens/sec/gpu | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Llama v4 Maverick | DEP8 | 8192 | 1024 | 1,276 output tokens/sec/gpu | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| GPT-OSS 20B | TP1 | 1000 | 1000 | 13,858 output tokens/sec/gpu | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| GPT-OSS 20B | TP1 | 1024 | 8192 | 12,743 output tokens/sec/gpu | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| GPT-OSS 20B | TP1 | 1024 | 32768 | output tokens/sec/gpu | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| GPT-OSS 20B | TP1 | 8192 | 1024 | 4,015 output tokens/sec/gpu | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| GPT-OSS 20B | TP1 | 32768 | 1024 | 9,154 output tokens/sec/gpu | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism
| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3 235B A22B | DEP8 | 1000 | 1000 | 1,932 output tokens/sec/gpu | 8x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
| Qwen3 235B A22B | DEP8 | 1024 | 8192 | 873 output tokens/sec/gpu | 8x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
| GPT-OSS 20B | TP1 | 1000 | 1000 | 11,557 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
| GPT-OSS 20B | TP1 | 1024 | 8192 | 8,617 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
| GPT-OSS 20B | TP1 | 8192 | 1024 | 3,366 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
| GPT-OSS 20B | TP1 | 32768 | 1024 | 785 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism
| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| Llama v4 Scout | TP2 PP2 | 128 | 2048 | 1,105 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP2 PP2 | 128 | 4096 | 707 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP4 | 2048 | 128 | 561 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP4 | 5000 | 500 | 307 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP2 PP2 | 500 | 2000 | 1,093 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP2 PP2 | 1000 | 1000 | 920 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP2 PP2 | 1000 | 2000 | 884 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP2 PP2 | 2048 | 2048 | 615 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP4 | 128 | 2048 | 1,694 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP2 PP2 | 128 | 4096 | 972 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP4 | 500 | 2000 | 1,413 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP4 | 1000 | 1000 | 1,498 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP4 | 1000 | 2000 | 1,084 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP4 | 2048 | 2048 | 773 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 128 | 128 | 8,471 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 128 | 4096 | 2,888 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 2048 | 128 | 1,017 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 5000 | 500 | 863 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 500 | 2000 | 4,032 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 1000 | 2000 | 3,134 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 2048 | 2048 | 2,148 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 20000 | 2000 | 280 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Stable Video Diffusion | 1 | 7.32 videos/min | - | 8202.75 | 1x B200 | DGX B200 | 26.02-py3 | Mixed | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| Stable Diffusion XL | 1 | 2.89 images/sec | - | 507.41 | 1x B200 | DGX B200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| BEVFusion Head | 1 | 2464.55 images/sec | 6 images/sec/watt | 0.41 | 1x B200 | DGX B200 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| Flux Image Generator | 1 | 0.47 images/sec | - | 2130.4 | 1x B200 | DGX B200 | 26.02-py3 | FP4 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| HF Swin Base | 128 | 4,948 samples/sec | 6 samples/sec/watt | 25.87 | 1x B200 | DGX B200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| HF Swin Large | 128 | 3,223 samples/sec | 3 samples/sec/watt | 39.71 | 1x B200 | DGX B200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| HF ViT Base | 2048 | 9,480 samples/sec | 10 samples/sec/watt | 216.04 | 1x B200 | DGX B200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| HF ViT Large | 1024 | 3,381 samples/sec | 4 samples/sec/watt | 302.83 | 1x B200 | DGX B200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| Yolo v10 M | 1 | 846.98 images/sec | 1.19 images/sec/watt | 1.18 | 1x B200 | DGX B200 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| Yolo v11 M | 1 | 1034.36 images/sec | 1.4 images/sec/watt | 0.97 | 1x B200 | DGX B200 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Stable Diffusion XL | 1 | 1.05 images/sec | 954 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 6000 BSE | |
| Flux Image Generator | 1 | 0.2 images/sec | - | 5072 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP4 | Synthetic | TensorRT 10.14.1 | RTX PRO 6000 BSE |
| BEVFusion Head | 1 | 1738.51 images/sec | 5 images/sec/watt | 0.58 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | RTX PRO 6000 BSE |
| HF Swin Base | 32 | 2,719 samples/sec | 5 samples/sec/watt | 11.77 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | RTX PRO 6000 BSE |
| HF Swin Large | 32 | 1,517 samples/sec | 3 samples/sec/watt | 21.1 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | RTX PRO 6000 BSE |
| HF ViT Base | 32 | 4,011 samples/sec | - | 8 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 6000 BSE |
| HF ViT Large | 16 | 1,280 samples/sec | - | 13 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 6000 BSE |
| Yolo v11 M | 1 | 465 images/sec | 1 images/sec/watt | 2.15 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | RTX PRO 6000 BSE |
HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Stable Diffusion XL | 1 | 0.4 images/sec | - | 2514 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
| Flux Image Generator | 1 | 0.07 images/sec | - | 13816 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP4 | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
| HF Bert Large QAT | 64 | 2,720 samples/sec | - | 24 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | INT8 | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
| HF Bert Large | 64 | 1,507 samples/sec | - | 42 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | Mixed | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
| HF ViT Base | 16 | 1,403 samples/sec | - | 11 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
| HF ViT Large | 4 | 449 samples/sec | - | 9 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Stable Video Diffusion | 1 | 4.83 videos/min | - | 12414.37 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| Stable Diffusion XL | 1 | 1.61 images/sec | - | 760.29 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| BEVFusion Head | 1 | 2006.49 images/sec | 6 images/sec/watt | 0.5 | 1x H200 | DGX H200 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| Flux Image Generator | 1 | .2 images/sec | - | 5010.27 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| HF Swin Base | 128 | 3,009 samples/sec | 4 samples/sec/watt | 42.54 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| HF Swin Large | 128 | 1,821 samples/sec | 3 samples/sec/watt | 70.28 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| HF ViT Base | 1024 | 4,943 samples/sec | 7 samples/sec/watt | 207.15 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| HF ViT Large | 1024 | 1,702 samples/sec | 2 samples/sec/watt | 601.64 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| Yolo v10 M | 1 | 431.92 images/sec | 0.68 images/sec/watt | 2.32 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| Yolo v11 M | 1 | 518.04 images/sec | 0.8 images/sec/watt | 1.93 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BEVFusion Head | 1 | 2006.78 images/sec | 6 images/sec/watt | 0.5 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| HF Swin Base | 128 | 2,919 samples/sec | 4 samples/sec/watt | 43.84 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| HF Swin Large | 128 | 1,752 samples/sec | 3 samples/sec/watt | 73.04 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| HF ViT Base | 1024 | 4,728 samples/sec | 7 samples/sec/watt | 216.57 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| HF ViT Large | 2048 | 1,629 samples/sec | 2 samples/sec/watt | 1256.97 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| Yolo v10 M | 1 | 433.06 images/sec | 0.66 images/sec/watt | 2.31 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| Yolo v11 M | 1 | 505.3 images/sec | 0.8 images/sec/watt | 1.98 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Stable Video Diffusion | 1 | 4.68 videos/min | - | 12811.33 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| Stable Diffusion XL | 1 | 1.54 images/sec | - | 780.31 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| BEVFusion Head | 1 | 1999.27 images/sec | 6 images/sec/watt | 0.5 | 1x H100 | DGX H100 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| HF Swin Base | 128 | 2,866 samples/sec | 4 samples/sec/watt | 44.67 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| HF Swin Large | 128 | 1,767 samples/sec | 3 samples/sec/watt | 72.42 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| HF ViT Base | 2048 | 4,864 samples/sec | 7 samples/sec/watt | 421.03 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| HF ViT Large | 2048 | 1,679 samples/sec | 2 samples/sec/watt | 1219.62 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| Yolo v10 M | 1 | 403.68 images/sec | 0.68 images/sec/watt | 2.48 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| Yolo v11 M | 1 | 476 images/sec | 0.76 images/sec/watt | 2.1 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BEVFusion Head | 1 | 1958.07 images/sec | 7 images/sec/watt | 0.51 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| HF Swin Base | 32 | 1,396 samples/sec | 4 samples/sec/watt | 22.92 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| HF Swin Large | 32 | 716 samples/sec | 2 samples/sec/watt | 44.72 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| HF ViT Base | 1024 | 1,662 samples/sec | 5 samples/sec/watt | 616.09 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| HF ViT Large | 1024 | 597 samples/sec | 2 samples/sec/watt | 1716.6 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| Yolo v10 M | 1 | 274.78 images/sec | 0.79 images/sec/watt | 3.64 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| Yolo v11 M | 1 | 310 images/sec | 0.9 images/sec/watt | 3.23 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384
Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.
NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.