AI Training
Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.
Click here to view other performance data.
NVIDIA Performance on MLPerf 6.0 Training Benchmarks
NVIDIA Performance on MLPerf 6.0’s AI Benchmarks: Single Node, Closed Division
| Network | Time to Train (mins) |
MLPerf Quality Target | GPU | Server | MLPerf ID | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| Llama2-70B-Lora | 11.6 | 0.925 Eval loss | 4x GB300 | 1xXE9712x4GB300 | 6.0-0048 | Mixed | SCROLLS GovReport | PyTorch | NVIDIA Blackwell Ultra GPU (GB300) |
| Llama2-70B-Lora | 6.6 | 0.925 Eval loss | 8x B300 | XA_NB3I-E12 | 6.0-0038 | Mixed | SCROLLS GovReport | PyTorch | NVIDIA Blackwell Ultra GPU (B300-SXM-270GB) |
| Llama2-70B-Lora | 7.9 | 0.925 Eval loss | 8x B200 | AS-4126GS-NBR-LCC | 6.0-0107 | Mixed | SCROLLS GovReport | NVIDIA NeMo | NVIDIA Blackwell GPU (B200-SXM-180GB) |
| Llama3.1 8B | 123.7 | 3.3 log perplexity | 4x GB300 | 1xXE9712x4GB300 | 6.0-0048 | Mixed | c4/en/3.0.1 | PyTorch | NVIDIA Blackwell Ultra GPU (GB300) |
| Llama3.1 8B | 72.0 | 3.3 log perplexity | 8x B300 | Nebius B300 n1 (8x B300-SXM-270GB) | 6.0-0023 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (B300-SXM-270GB) |
| Llama3.1 8B | 82.2 | 3.3 log perplexity | 8x B200 | 1xXE9780x8B200-SXM-180GB | 6.0-0052 | Mixed | c4/en/3.0.1 | DGL | NVIDIA Blackwell GPU (B200-SXM-180GB) |
| GPT-OSS 20B | 152.7 | 3.34 log perplexity | 4x GB300 | 1xXE9712x4GB300 | 6.0-0048 | Mixed | c4/en/3.0.1 | PyTorch | NVIDIA Blackwell Ultra GPU (GB300) |
| GPT-OSS 20B | 83.6 | 3.34 log perplexity | 8x B300 | Nebius B300 n1 (8x B300-SXM-270GB) | 6.0-0023 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (B300-SXM-270GB) |
| GPT-OSS 20B | 96.5 | 3.34 log perplexity | 8x B200 | Lambda-1-Click-Cluster_B200_n1 | 6.0-0008 | Mixed | c4/en/3.0.1 | PyTorch | NVIDIA Blackwell GPU (B200-SXM-180GB) |
| DLRM-dcnv2 | 2.2 | 0.80275 AUC | 8x B300 | G894-SD3-AAX7 | 6.0-0065 | Mixed | Criteo 3.5TB Click Logs (multi-hot variant) | NVIDIA Merlin HugeCTR | NVIDIA Blackwell Ultra GPU (B300-SXM-270GB) |
| DLRM-dcnv2 | 2.3 | 0.80275 AUC | 8x B200 | SYS-A22GA-NBRT | 6.0-0113 | Mixed | Criteo 3.5TB Click Logs (multi-hot variant) | NVIDIA Merlin HugeCTR | NVIDIA Blackwell GPU (B200-SXM-180GB) |
NVIDIA Performance on MLPerf 6.0’s AI Benchmarks: Multi Node, Closed Division
| Network | Time to Train (mins) |
MLPerf Quality Target | GPU | Server | MLPerf ID | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| Llama 3.1 405B | 58.3 | 5.6 log perplexity | 512x GB300 | Theia-cmh (8x NVIDIA GB300 NVL72) | 6.0-0013 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| Llama 3.1 405B | 18.5 | 5.6 log perplexity | 2,048x GB300 | Theia-cmh (32x NVIDIA GB300 NVL72) | 6.0-0012 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| Llama 3.1 405B | 9.8 | 5.6 log perplexity | 4,096x GB300 | CoreWeave_GB300_1024x4_nccl2297 | 6.0-0004 | Mixed | c4/en/3.0.1 | PyTorch | NVIDIA Blackwell Ultra GPU (GB300) |
| Llama 3.1 405B | 18.8 | 5.6 log perplexity | 2,560x GB200 | Tyche-hsg (40x NVIDIA GB200 NVL72) | 6.0-0019 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell GPU (GB200) |
| Llama 3.1 405B | 10.0 | 5.6 log perplexity | 5,120x GB200 | Tyche-hsg (80x NVIDIA GB200 NVL72) | 6.0-0021 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell GPU (GB200) |
| Llama 3.1 405B | 7.1 | 5.6 log perplexity | 8,192x GB200 | Azure GB200 (128x NVIDIA GB200 NVL72) | 6.0-0001 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell GPU (GB200) |
| DeepSeek v3 671B | 33.4 | 3.6 log perplexity | 256x GB300 | Theia (4x NVIDIA GB300 NVL72) | 6.0-0099 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| DeepSeek v3 671B | 17.5 | 3.6 log perplexity | 512x GB300 | Theia (8x NVIDIA GB300 NVL72) | 6.0-0101 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| DeepSeek v3 671B | 5.5 | 3.6 log perplexity | 2,048x GB300 | CoreWeave_GB300_512x4 | 6.0-0006 | Mixed | c4/en/3.0.1 | PyTorch | NVIDIA Blackwell Ultra GPU (GB300) |
| DeepSeek v3 671B | 3.1 | 3.6 log perplexity | 4,096x GB300 | CoreWeave_GB300_1024x4 | 6.0-0003 | Mixed | c4/en/3.0.1 | PyTorch | NVIDIA Blackwell Ultra GPU (GB300) |
| DeepSeek v3 671B | 2.0 | 3.6 log perplexity | 8,192x GB300 | CoreWeave_GB300_2048x4 | 6.0-0005 | Mixed | c4/en/3.0.1 | PyTorch | NVIDIA Blackwell Ultra GPU (GB300) |
| DeepSeek v3 671B | 49.4 | 3.6 log perplexity | 256x GB200 | NVIDIA GB200 NVL72 (64 nodes, 4 NVLink domains) | 6.0-0007 | Mixed | c4/en/3.0.1 | PyTorch | NVIDIA Blackwell GPU (GB200) |
| DeepSeek v3 671B | 27.6 | 3.6 log perplexity | 512x GB200 | Tyche-hsg (8x NVIDIA GB200 NVL72) | 6.0-0022 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell GPU (GB200) |
| DeepSeek v3 671B | 7.8 | 3.6 log perplexity | 2,048x GB200 | Tyche-hsg (32x NVIDIA GB200 NVL72) | 6.0-0018 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell GPU (GB200) |
| DeepSeek v3 671B | 4.8 | 3.6 log perplexity | 4,096x GB200 | Tyche-hsg (64x NVIDIA GB200 NVL72) | 6.0-0020 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell GPU (GB200) |
| DeepSeek v3 671B | 3.3 | 3.6 log perplexity | 8,192x GB200 | Tyche-hsg (128x NVIDIA GB200 NVL72) | 6.0-0014 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell GPU (GB200) |
| Llama2-70B-Lora | 5.6 | 0.925 Eval loss | 8x GB300 | BM.GPU.GB300.4 | 6.0-0031 | Mixed | SCROLLS GovReport | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| Llama2-70B-Lora | 2.5 | 0.925 Eval loss | 32x GB300 | NVIDIA GB300 NVL72 by HPE | 6.0-0076 | Mixed | SCROLLS GovReport | NVIDIA NeMo/PyTorch | NVIDIA Blackwell Ultra GPU (GB300) |
| Llama2-70B-Lora | 1.3 | 0.925 Eval loss | 64x GB300 | D75U-1U_ngpu64 | 6.0-0104 | Mixed | SCROLLS GovReport | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| Llama2-70B-Lora | 0.4 | 0.925 Eval loss | 512x GB300 | Theia (8x NVIDIA GB300 NVL72) | 6.0-0101 | Mixed | SCROLLS GovReport | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| Llama2-70B-Lora | 4.0 | 0.925 Eval loss | 16x B300 | Cisco UCS C880A-8xB300-SXM-288G | 6.0-0040 | Mixed | SCROLLS GovReport | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (B300-SXM-270GB) |
| Llama2-70B-Lora | 5.3 | 0.925 Eval loss | 16x GB200 | NVIDIA GB200 NVL72 by HPE | 6.0-0071 | Mixed | SCROLLS GovReport | NVIDIA NeMo | NVIDIA Blackwell GPU (GB200) |
| Llama2-70B-Lora | 2.9 | 0.925 Eval loss | 32x GB200 | NVIDIA GB200 NVL72 by HPE | 6.0-0072 | Mixed | SCROLLS GovReport | NVIDIA NeMo | NVIDIA Blackwell GPU (GB200) |
| Llama2-70B-Lora | 6.3 | 0.925 Eval loss | 16x B200 | HPE ProLiant Compute XD685 | 6.0-0069 | Mixed | SCROLLS GovReport | NVIDIA NeMo | NVIDIA Blackwell GPU (B200-SXM-180GB) |
| Llama3.1 8B | 63.5 | 3.3 log perplexity | 8x GB300 | NVIDIA GB300 NVL72 by HPE | 6.0-0078 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| Llama 3.1 8B | 33.4 | 3.3 log perplexity | 16x GB300 | NVIDIA GB300 NVL72 by HPE | 6.0-0075 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| Llama 3.1 8B | 20.2 | 3.3 log perplexity | 32x GB300 | 8xXE9712x4GB300 | 6.0-0060 | Mixed | c4/en/3.0.1 | PyTorch | NVIDIA Blackwell Ultra GPU (GB300) |
| Llama 3.1 8B | 11.6 | 3.3 log perplexity | 72x GB300 | Lambda_GB300_n18 | 6.0-0009 | Mixed | c4/en/3.0.1 | PyTorch | NVIDIA Blackwell Ultra GPU (GB300) |
| Llama 3.1 8B | 4.6 | 3.3 log perplexity | 512x GB300 | Theia (8x NVIDIA GB300 NVL72) | 6.0-0101 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| Llama 3.1 8B | 43.3 | 3.3 log perplexity | 16x B300 | 2xXE9780x8B300-SXM-270GB | 6.0-0058 | Mixed | c4/en/3.0.1 | PyTorch | NVIDIA Blackwell Ultra GPU (B300-SXM-270GB) |
| Llama 3.1 8B | 14.4 | 3.3 log perplexity | 64x B300 | Nebius B300 n8 (64x B300-SXM-270GB) | 6.0-0025 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (B300-SXM-270GB) |
| Llama3.1 8B | 79.7 | 3.3 log perplexity | 8x GB200 | NVIDIA GB200 NVL72 by HPE | 6.0-0074 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell GPU (GB200) |
| Llama 3.1 8B | 49.0 | 3.3 log perplexity | 16x GB200 | NVIDIA GB200 NVL72 by HPE | 6.0-0071 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell GPU (GB200) |
| Llama 3.1 8B | 39.0 | 3.3 log perplexity | 32x GB200 | NVIDIA GB200 NVL72 by HPE | 6.0-0072 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell GPU (GB200) |
| Llama 3.1 8B | 4.5 | 3.3 log perplexity | 1,024x GB200 | Tyche-hsg (16x NVIDIA GB200 NVL72) | 6.0-0015 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell GPU (GB200) |
| Llama 3.1 8B | 52.3 | 3.3 log perplexity | 16x B200 | HPE ProLiant Compute XD685 | 6.0-0069 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell GPU (B200-SXM-180GB) |
| Llama 3.1 8B | 16.5 | 3.3 log perplexity | 64x B200 | CoreWeave_B200_8x8 | 6.0-0002 | Mixed | c4/en/3.0.1 | PyTorch | NVIDIA Blackwell GPU (B200-SXM-180GB) |
| GPT-OSS 20B | 74.1 | 3.34 log perplexity | 8x GB300 | NVIDIA GB300 NVL72 by HPE | 6.0-0078 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| GPT-OSS 20B | 43.2 | 3.34 log perplexity | 16x GB300 | NVIDIA GB300 NVL72 by HPE | 6.0-0075 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| GPT-OSS 20B | 27.9 | 3.34 log perplexity | 32x GB300 | NVIDIA GB300 NVL72 by HPE | 6.0-0076 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| GPT-OSS 20B | 18.1 | 3.34 log perplexity | 72x GB300 | DLB2-CB3 | 6.0-0063 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| GPT-OSS 20B | 7.4 | 3.34 log perplexity | 512x GB300 | Theia (8x NVIDIA GB300 NVL72) | 6.0-0101 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| GPT-OSS 20B | 53.9 | 3.34 log perplexity | 16x B300 | 2xXE9780x8B300-SXM-270GB | 6.0-0058 | Mixed | c4/en/3.0.1 | PyTorch | NVIDIA Blackwell Ultra GPU (B300-SXM-270GB) |
| GPT-OSS 20B | 94.4 | 3.34 log perplexity | 8x GB200 | Tyche-hsg (1x NVIDIA GB200 NVL72) | 6.0-0017 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell GPU (GB200) |
| GPT-OSS 20B | 19.2 | 3.34 log perplexity | 72x GB200 | Tyche-hsg (1x NVIDIA GB200 NVL72) | 6.0-0016 | Mixed | c4/en/3.0.1 | NVIDIA NeMo | NVIDIA Blackwell GPU (GB200) |
| GPT-OSS 20B | 27.0 | 3.34 log perplexity | 64x B200 | CoreWeave_B200_8x8 | 6.0-0002 | Mixed | c4/en/3.0.1 | PyTorch | NVIDIA Blackwell GPU (B200-SXM-180GB) |
| Flux1 | 112.4 | 0.586 Eval loss | 16x GB300 | 4xXE9712x4GB300 | 6.0-0059 | Mixed | CC12M | PyTorch | NVIDIA Blackwell Ultra GPU (GB300) |
| Flux1 | 65.0 | 0.586 Eval loss | 32x GB300 | BM.GPU.GB300.4 | 6.0-0029 | Mixed | CC12M | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| Flux1 | 36.5 | 0.586 Eval loss | 72x GB300 | DLB2-CB3 | 6.0-0063 | Mixed | CC12M | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| Flux1 | 17.1 | 0.586 Eval loss | 512x GB300 | Theia (8x NVIDIA GB300 NVL72) | 6.0-0100 | Mixed | CC12M | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| Flux1 | 77.5 | 0.586 Eval loss | 32x B300 | Nebius B300 n4 (32x B300-SXM-270GB) | 6.0-0024 | Mixed | CC12M | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (B300-SXM-270GB) |
| Flux1 | 46.7 | 0.586 Eval loss | 64x B300 | Nebius B300 n8 (64x B300-SXM-270GB) | 6.0-0025 | Mixed | CC12M | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (B300-SXM-270GB) |
| DLRM-dcnv2 | 2.1 | 0.80275 AUC | 8x GB300 | Theia (1x NVIDIA GB300 NVL72) | 6.0-0097 | Mixed | Criteo 3.5TB Click Logs (multi-hot variant) | NVIDIA Merlin HugeCTR | NVIDIA Blackwell Ultra GPU (GB300) |
| DLRM-dcnv2 | 0.7 | 0.80275 AUC | 64x GB300 | DLB2-CB3 | 6.0-0062 | Mixed | Criteo 3.5TB Click Logs (multi-hot variant) | NVIDIA NeMo | NVIDIA Blackwell Ultra GPU (GB300) |
| DLRM-dcnv2 | 1.9 | 0.80275 AUC | 16x B300 | 2xXE9780x8B300-SXM-270GB | 6.0-0058 | Mixed | Criteo 3.5TB Click Logs (multi-hot variant) | PyTorch | NVIDIA Blackwell Ultra GPU (B300-SXM-270GB) |
MLPerf™ v6.0 Training Closed: 6.0-0003, 6.0-0005, 6.0-0010, 6.0-0013, 6.0-0017, 6.0-0023, 6.0-0024, 6.0-0025, 6.0-0027, 6.0-0028, 6.0-0030, 6.0-0035, 6.0-0037, 6.0-0038, 6.0-0039, 6.0-0041, 6.0-0042, 6.0-0043, 6.0-0065, 6.0-0066, 6.0-0067, 6.0-0070, 6.0-0073, 6.0-0079, 6.0-0083, 6.0-0084, 6.0-0085, 6.0-0086, 6.0-0087, 6.0-0088, 6.0-0089, 6.0-0090, 6.0-0091, 6.0-0094, 6.0-0095, 6.0-0096, 6.0-0097, 6.0-0098, 6.0-0100, 6.0-0101, 6.0-0102, 6.0-0103, 6.0-0104, 6.0-0105, 6.0-0106, 6.0-0107, 6.0-0113, 6.0-0117 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For Training rules and guidelines, click here
LLM Training Performance on NVIDIA Data Center Products
GB300 Training Performance
| Framework | Model | Throughput per GPU | GPU | Server | Container Version | Sequence Length | TP | PP | CP | EP | Precision | Global Batch Size | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NVIDIA Nemo | DeepSeek v3 | 6,422 tokens/sec/gpu | 256x GB300 | DGX GB300 | - | 4096 | 1 | 2 | 1 | 32 | FP8 | 15360 | NVIDIA GB300 |
| GPT-OSS 120B | 19,275 tokens/sec/gpu | 64x GB300 | DGX GB300 | nemo:26.04 | 4096 | 1 | 1 | 1 | 64 | BF16 | 1280 | NVIDIA GB300 | |
| Qwen3 30B a3B | 31,470 tokens/sec/gpu | 8x GB300 | DGX GB300 | nemo:26.04 | 4096 | 1 | 1 | 1 | 8 | FP8 | 512 | NVIDIA GB300 | |
| Qwen3 235B a22B | 6,994 tokens/sec/gpu | 256x GB300 | DGX GB300 | nemo:26.04 | 4096 | 1 | 4 | 1 | 32 | FP8 | 4096 | NVIDIA GB300 | |
| Nemotron 3 Nano | 38,102 tokens/sec/gpu | 8x GB300 | DGX GB300 | nemo:26.04 | 8192 | 1 | 1 | 1 | 8 | FP8 | 512 | NVIDIA GB300 | |
| Nemotron 3 Super | 9,623 tokens/sec/gpu | 64x GB300 | DGX GB300 | nemo:26.04 | 8192 | 1 | 1 | 1 | 64 | FP4 | 512 | NVIDIA GB300 | |
| Kimi K2 | 5,332 tokens/sec/gpu | 256x GB300 | DGX GB300 | nemo:26.04 | 4096 | 1 | 4 | 1 | 64 | FP8 | 4096 | NVIDIA GB300 |
TP: Tensor Parallelism
PP: Pipeline Parallelism
CP: Context Parallelism
EP: Expert Parallelism
B300 Training Performance
| Framework | Model | Throughput per GPU | GPU | Server | Container Version | Sequence Length | TP | PP | CP | EP | Precision | Global Batch Size | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NVIDIA Nemo | DeepSeek v3 | 3,131 tokens/sec/gpu | 256x B300 | DGX B300 | nemo:26.04 | 4096 | 1 | 4 | 1 | 64 | FP8 | 4096 | NVIDIA B300 |
| GPT-OSS 120B | 15,114 tokens/sec/gpu | 64x B300 | DGX B300 | nemo:26.04 | 4096 | 1 | 1 | 1 | 8 | BF16 | 1280 | NVIDIA B300 | |
| Qwen3 235B a22B | 4,865 tokens/sec/gpu | 256x B300 | DGX B300 | nemo:26.04 | 4096 | 1 | 8 | 1 | 8 | FP8 | 8192 | NVIDIA B300 | |
| Nemotron3 Super | 7,047 tokens/sec/gpu | 64x B300 | DGX B300 | nemo:26.04 | 8192 | 1 | 1 | 1 | 8 | FP4 | 512 | NVIDIA B300 |
TP: Tensor Parallelism
PP: Pipeline Parallelism
CP: Context Parallelism
EP: Expert Parallelism
B200 Training Performance
| Framework | Model | Throughput per GPU | GPU | Server | Container Version | Sequence Length | TP | PP | CP | EP | Precision | Global Batch Size | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NVIDIA Nemo | DeepSeek v3 | 2,815 tokens/sec/gpu | 256x B200 | DGX B200 | nemo:26.04 | 4096 | 1 | 16 | 1 | 8 | FP8 | 4096 | NVIDIA B200 |
| GPT-OSS 120B | 13,045 tokens/sec/gpu | 64x B200 | DGX B200 | nemo:26.04 | 4096 | 1 | 1 | 1 | 8 | BF16 | 4096 | NVIDIA B200 | |
| Qwen3 30B a3B | 26,859 tokens/sec/gpu | 8x B200 | DGX B200 | nemo:26.04 | 4096 | 1 | 1 | 1 | 8 | FP8 | 512 | NVIDIA B200 |
TP: Tensor Parallelism
PP: Pipeline Parallelism
CP: Context Parallelism
EP: Expert Parallelism
H100 Training Performance
| Framework | Model | Throughput per GPU | GPU | Server | Container Version | Sequence Length | TP | PP | CP | EP | Precision | Global Batch Size | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NVIDIA Nemo | GPT-OSS 120B | 5,810 tokens/sec/gpu | 64x H100 | DGX H100 | nemo:26.04 | 4096 | 1 | 4 | 1 | 8 | BF16 | 1280 | H100-SXM5-80GB |
| Qwen3 30B a3B | 8,901 tokens/sec/gpu | 16x H100 | DGX H100 | nemo:26.04 | 4096 | 1 | 1 | 1 | 16 | FP8 | 1024 | H100-SXM5-80GB | |
| Qwen3 235B a22B | 1,686 tokens/sec/gpu | 256x H100 | DGX H100 | nemo:26.04 | 4096 | 2 | 8 | 1 | 32 | FP8 | 8192 | H100-SXM5-80GB | |
| Nemotron3 Nano | 14,507 tokens/sec/gpu | 16x H100 | DGX H100 | nemo:26.04 | 8192 | 1 | 1 | 1 | 8 | FP8 | 1024 | H100-SXM5-80GB |
TP: Tensor Parallelism
PP: Pipeline Parallelism
CP: Context Parallelism
EP: Expert Parallelism
View More Performance Data
AI Inference
Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.
Learn MoreAI Pipeline
NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.
Learn More