AI Training
Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.
Click here to view other performance data.
NVIDIA Performance on MLPerf 5.0 Training Benchmarks
NVIDIA Performance on MLPerf 5.0’s AI Benchmarks: Single Node, Closed Division
Framework | Network | Time to Train (mins) |
MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|
Nemo | Llama2-70B-lora | 11 | 0.925 Eval loss | 8x GB200 | BM.GPU.GB200.4 | 5.0-0020 | Mixed | SCROLLS GovReport | NVIDIA Blackwell GPU (GB200) |
11.2 | 0.925 Eval loss | 8x B200 | SYS-422GA-NBRT-LCC | 5.0-0089 | Mixed | SCROLLS GovReport | NVIDIA Blackwell GPU (B200-SXM-180GB) | ||
Stable Diffusion | 12.9 | FID⇐90 and and CLIP>=0.15 | 8x GB200 | Tyche (1x NVIDIA GB200 NVL72) | 5.0-0071 | Mixed | LAION-400M-filtered | NVIDIA Blackwell GPU (GB200) | |
13 | FID⇐90 and and CLIP>=0.15 | 8x B200 | SYS-422GA-NBRT-LCC | 5.0-0089 | Mixed | LAION-400M-filtered | NVIDIA Blackwell GPU (B200-SXM-180GB) | ||
PyTorch | BERT | 3.4 | 0.72 Mask-LM accuracy | 8x GB200 | Tyche (1x NVIDIA GB200 NVL72) | 5.0-0072 | Mixed | Wikipedia 2020/01/01 | NVIDIA Blackwell GPU (GB200) |
3.5 | 0.72 Mask-LM accuracy | 8x B200 | 1xXE9680Lx8B200-SXM-180GB | 5.0-0033 | Mixed | Wikipedia 2020/01/01 | NVIDIA Blackwell GPU (B200-SXM-180GB) | ||
RetinaNet | 22.3 | 34.0% mAP | 8x GB200 | Tyche (1x NVIDIA GB200 NVL72) | 5.0-0072 | Mixed | A subset of OpenImages | NVIDIA Blackwell GPU (GB200) | |
21.8 | 34.0% mAP | 8x B200 | AS-A126GS-TNBR | 5.0-0085 | Mixed | A subset of OpenImages | NVIDIA Blackwell GPU (B200-SXM-180GB) | ||
DGL | R-GAT | 5 | 72.0 % classification | 8x GB200 | Tyche (1x NVIDIA GB200 NVL72) | 5.0-0069 | Mixed | IGBH-Full | NVIDIA Blackwell GPU (GB200) |
5.1 | 72.0 % classification | 8x B200 | G893-SD1 | 5.0-0046 | Mixed | IGBH-Full | NVIDIA Blackwell GPU (B200-SXM-180GB) | ||
NVIDIA Merlin HugeCTR | DLRM-dcnv2 | 2.2 | 0.80275 AUC | 8x GB200 | Tyche (1x NVIDIA GB200 NVL72) | 5.0-0070 | Mixed | Criteo 3.5TB Click Logs (multi-hot variant) | NVIDIA Blackwell GPU (GB200) |
2.3 | 0.80275 AUC | 8x B200 | Nyx (1x NVIDIA DGX B200) | 5.0-0061 | Mixed | Criteo 3.5TB Click Logs (multi-hot variant) | NVIDIA Blackwell GPU (B200-SXM-180GB) |
NVIDIA Performance on MLPerf 5.0’s AI Benchmarks: Multi Node, Closed Division
Framework | Network | Time to Train (mins) |
MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|
NVIDIA NeMo | Llama 3.1 405B | 240.3 | 5.6 log perplexity | 256x GB200 | Tyche (4x NVIDIA GB200 NVL72) | 5.0-0075 | Mixed | c4/en/3.0.1 | NVIDIA Blackwell GPU (GB200) |
121.1 | 5.6 log perplexity | 512x GB200 | Carina (8x NVIDIA GB200 NVL72) | 5.0-0005 | Mixed | c4/en/3.0.1 | NVIDIA Blackwell GPU (GB200) | ||
62.1 | 5.6 log perplexity | 1,024x GB200 | Carina (16x NVIDIA GB200 NVL72) | 5.0-0001 | Mixed | c4/en/3.0.1 | NVIDIA Blackwell GPU (GB200) | ||
27.3 | 5.6 log perplexity | 2,496x GB200 | Carina (39x NVIDIA GB200 NVL72) | 5.0-0004 | Mixed | c4/en/3.0.1 | NVIDIA Blackwell GPU (GB200) | ||
20.8 | 5.6 log perplexity | 8,192x H100 | Eos-dfw (1024x NVIDIA HGX H100) | 5.0-0010 | Mixed | c4/en/3.0.1 | NVIDIA H100-SXM5-80GB | ||
Llama2-70B-lora | 1.9 | 0.925 Eval loss | 64x GB200 | 16xXE9712x4GB200 | 5.0-0031 | Mixed | SCROLLS GovReport | NVIDIA Blackwell GPU (GB200) | |
1.1 | 0.925 Eval loss | 144x GB200 | Tyche (2x NVIDIA GB200 NVL72) | 5.0-0073 | Mixed | SCROLLS GovReport | NVIDIA Blackwell GPU (GB200) | ||
0.6 | 0.925 Eval loss | 512x GB200 | Tyche (8x NVIDIA GB200 NVL72) | 5.0-0076 | Mixed | SCROLLS GovReport | NVIDIA Blackwell GPU (GB200) | ||
6.1 | 0.925 Eval loss | 16x B200 | AS-4126GS-NBR-LCC_N2 | 5.0-0083 | Mixed | SCROLLS GovReport | NVIDIA Blackwell GPU (B200-SXM-180GB) | ||
2 | 0.925 Eval loss | 64x B200 | BM.GPU.B200.8 | 5.0-0018 | Mixed | SCROLLS GovReport | NVIDIA Blackwell GPU (B200-SXM-180GB) | ||
Stable Diffusion | 7.6 | FID⇐90 and and CLIP>=0.15 | 16x GB200 | 4xXE9712x4GB200 | 5.0-0040 | Mixed | LAION-400M-filtered | NVIDIA Blackwell GPU (GB200) | |
4.3 | FID⇐90 and and CLIP>=0.15 | 32x GB200 | 8xXE9712x4GB200 | 5.0-0041 | Mixed | LAION-400M-filtered | NVIDIA Blackwell GPU (GB200) | ||
2.7 | FID⇐90 and and CLIP>=0.15 | 64x GB200 | 16xXE9712x4GB200 | 5.0-0031 | Mixed | LAION-400M-filtered | NVIDIA Blackwell GPU (GB200) | ||
1 | FID⇐90 and and CLIP>=0.15 | 512x GB200 | Tyche (8x NVIDIA GB200 NVL72) | 5.0-0076 | Mixed | LAION-400M-filtered | NVIDIA Blackwell GPU (GB200) | ||
2.8 | FID⇐90 and and CLIP>=0.15 | 64x B200 | BM.GPU.B200.8 | 5.0-0018 | Mixed | LAION-400M-filtered | NVIDIA Blackwell GPU (B200-SXM-180GB) | ||
PyTorch | BERT | 2.1 | 0.72 Mask-LM accuracy | 16x GB200 | 4xXE9712x4GB200 | 5.0-0040 | Mixed | Wikipedia 2020/01/01 | NVIDIA Blackwell GPU (GB200) |
1.5 | 0.72 Mask-LM accuracy | 32x GB200 | 8xXE9712x4GB200 | 5.0-0041 | Mixed | Wikipedia 2020/01/01 | NVIDIA Blackwell GPU (GB200) | ||
0.7 | 0.72 Mask-LM accuracy | 64x GB200 | Tyche (1x NVIDIA GB200 NVL72) | 5.0-0065 | Mixed | Wikipedia 2020/01/01 | NVIDIA Blackwell GPU (GB200) | ||
0.3 | 0.72 Mask-LM accuracy | 512x GB200 | Tyche (8x NVIDIA GB200 NVL72) | 5.0-0077 | Mixed | Wikipedia 2020/01/01 | NVIDIA Blackwell GPU (GB200) | ||
2.3 | 0.72 Mask-LM accuracy | 16x B200 | 2xXE9680Lx8B200-SXM-180GB | 5.0-0037 | Mixed | Wikipedia 2020/01/01 | NVIDIA Blackwell GPU (B200-SXM-180GB) | ||
RetinaNet | 12.3 | 34.0% mAP | 16x GB200 | 4xXE9712x4GB200 | 5.0-0040 | Mixed | A subset of OpenImages | NVIDIA Blackwell GPU (GB200) | |
9 | 34.0% mAP | 32x GB200 | 8xXE9712x4GB200 | 5.0-0041 | Mixed | A subset of OpenImages | NVIDIA Blackwell GPU (GB200) | ||
4.3 | 34.0% mAP | 64x GB200 | 16xXE9712x4GB200 | 5.0-0031 | Mixed | A subset of OpenImages | NVIDIA Blackwell GPU (GB200) | ||
1.4 | 34.0% mAP | 512x GB200 | Tyche (8x NVIDIA GB200 NVL72) | 5.0-0077 | Mixed | A subset of OpenImages | NVIDIA Blackwell GPU (GB200) | ||
14 | 34.0% mAP | 16x B200 | 2xXE9680Lx8B200-SXM-180GB | 5.0-0037 | Mixed | A subset of OpenImages | NVIDIA Blackwell GPU (B200-SXM-180GB) | ||
4.4 | 34.0% mAP | 64x B200 | BM.GPU.B200.8 | 5.0-0018 | Mixed | A subset of OpenImages | NVIDIA Blackwell GPU (B200-SXM-180GB) | ||
DGL | R-GAT | 1.1 | 72.0 % classification | 72x GB200 | Tyche (1x NVIDIA GB200 NVL72) | 5.0-0066 | Mixed | IGBH-Full | NVIDIA Blackwell GPU (GB200) |
0.8 | 72.0 % classification | 256x GB200 | Tyche (4x NVIDIA GB200 NVL72) | 5.0-0074 | Mixed | IGBH-Full | NVIDIA Blackwell GPU (GB200) | ||
NVIDIA Merlin HugeCTR | DLRM-dcnv2 | 0.7 | 0.80275 AUC | 64x GB200 | SRS-GB200-NVL72-M1 (16x ARS-121GL-NBO) | 5.0-0087 | Mixed | Criteo 3.5TB Click Logs (multi-hot variant) | NVIDIA Blackwell GPU (GB200) |
MLPerf™ v5.0 Training Closed: 5.0-0001, 5.0-0004, 5.0-0005, 5.0-0010, 5.0-0018, 5.0-0020, 5.0-0031, 5.0-0033, 5.0-0037, 5.0-0040, 5.0-0041, 5.0-0046, 5.0-0061, 5.0-0065, 5.0-0066, 5.0-0068, 5.0-0069, 5.0-0070, 5.0-0071, 5.0-0072, 5.0-0073, 5.0-0074, 5.0-0075, 5.0-0076, 5.0-0077, 5.0-0083, 5.0-0085, 5.0-0087, 5.0-0089 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For Training rules and guidelines, click here
NVIDIA Performance on MLPerf 3.0’s Training HPC Benchmarks: Closed Division
Framework | Network | Time to Train (mins) |
MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|
PyTorch | CosmoFlow | 2.1 | Mean average error 0.124 | 512x H100 | eos | 3.0-8006 | Mixed | CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets | H100-SXM5-80GB |
DeepCAM | 0.8 | IOU 0.82 | 2,048x H100 | eos | 3.0-8007 | Mixed | CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background) | H100-SXM5-80GB | |
OpenCatalyst | 10.7 | Forces mean absolute error 0.036 | 640x H100 | eos | 3.0-8008 | Mixed | Open Catalyst 2020 (OC20) S2EF 2M training split, ID validation set | H100-SXM5-80GB | |
OpenFold | 7.5 | Local Distance Difference Test (lDDT-Cα) >= 0.8 | 2,080x H100 | eos | 3.0-8009 | Mixed | OpenProteinSet and Protein Data Bank | H100-SXM5-80GB |
MLPerf™ v3.0 Training HPC Closed: 3.0-8006, 3.0-8007, 3.0-8008, 3.0-8009 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v3.0 Training HPC rules and guidelines, click here
LLM Training Performance on NVIDIA Data Center Products
B200 Training Performance
Framework | Model | Time to Train (days) | Throughput per GPU | GPU | Server | Container Version | Sequence Length | TP | PP | CP | Precision | Global Batch Size | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Nemo | GPT3 175B | 7 | 1,523 tokens/sec | 512x B200 | DGX B200 | nemo:25.04 | 2048 | 4 | 4 | 1 | FP8 | 2048 | NVIDIA B200 |
Llama3 70B | 3 | 3,562 tokens/sec | 64x B200 | DGX B200 | nemo:25.04 | 8192 | 2 | 4 | 2 | FP8 | 128 | NVIDIA B200 | |
Llama3 405B | 17 | 651 tokens/sec | 128x B200 | DGX B200 | nemo:25.04 | 8192 | 4 | 8 | 2 | FP8 | 64 | NVIDIA B200 | |
Nemotron 15B | 0.7 | 16,222 tokens/sec | 64x B200 | DGX B200 | nemo:25.04 | 4096 | 1 | 1 | 1 | FP8 | 256 | NVIDIA B200 | |
Nemotron 340B | 18 | 632 tokens/sec | 128x B200 | DGX B200 | nemo:25.04 | 4096 | 8 | 4 | 1 | FP8 | 32 | NVIDIA B200 | |
Mixtral 8x7B | 0.6 | 17,617 tokens/sec | 64x B200 | DGX B200 | nemo:25.04 | 4096 | 1 | 1 | 1 | FP8 | 256 | NVIDIA B200 | |
Mixtral 8x22B | 5 | 2,399 tokens/sec | 256x B200 | DGX B200 | nemo:25.04 | 65536 | 2 | 4 | 8 | FP8 | 1 | NVIDIA B200 |
TP: Tensor Parallelism
PP: Pipeline Parallelism
CP: Context Parallelism
Time to Train is estimated time to train on 1T tokens with 1K GPUs
Converged Training Performance on NVIDIA Data Center GPUs
H200 Training Performance
Framework | Framework Version | Network | Time to Train (mins) |
Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 2.4.0a0 | Tacotron2 | 65 | .56 Training Loss | 496,465 total output mels/sec | 8x H200 | DGX H200 | 24.12-py3 | TF32 | 128 | LJSpeech 1.1 | NVIDIA H200 |
2.4.0a0 | WaveGlow | 106 | -5.7 Training Loss | 4,124,433 output samples/sec | 8x H200 | DGX H200 | 24.12-py3 | Mixed | 10 | LJSpeech 1.1 | NVIDIA H200 | |
2.4.0a0 | NCF | .96 Hit Rate at 10 | 252,318,096 samples/sec | 8x H200 | DGX H200 | 24.12-py3 | Mixed | 131072 | MovieLens 20M | NVIDIA H200 | ||
2.4.0a0 | FastPitch | 66 | .17 Training Loss | 1,465,568 frames/sec | 8x H200 | DGX H200 | 24.12-py3 | TF32 | 32 | LJSpeech 1.1 | NVIDIA H200 | |
2.4.0a0 | Transformer XL Large | 264 | 17.82 Perplexity | 317,663 total tokens/sec | 8x H200 | DGX H200 | 24.12-py3 | Mixed | 16 | WikiText-103 | NVIDIA H200 | |
2.4.0a0 | Transformer XL Base | 116 | 21.6 Perplexity | 1,163,450 total tokens/sec | 8x H200 | DGX H200 | 24.12-py3 | Mixed | 128 | WikiText-103 | NVIDIA H200 | |
2.4.0a0 | EfficientDet-D0 | 303 | .33 BBOX mAP | 2,793 images/sec | 8x H200 | DGX H200 | 24.12-py3 | Mixed | 150 | COCO 2017 | NVIDIA H200 | |
2.4.0a0 | HiFiGAN | 915 | 9.75 Training Loss | 120,606 total output mels/sec | 8x H200 | DGX H200 | 24.12-py3 | Mixed | 16 | LJSpeech-1.1 | NVIDIA H200 |
H100 Training Performance
Framework | Framework Version | Network | Time to Train (mins) |
Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 2.4.0a0 | Tacotron2 | . Training Loss | 477,113 total output mels/sec | 8x H100 | DGX H100 | 24.12-py3 | Mixed | 128 | LJSpeech 1.1 | H100-SXM5-80GB | |
2.4.0a0 | WaveGlow | . Training Loss | 3,809,464 output samples/sec | 8x H100 | DGX H100 | 24.12-py3 | Mixed | 10 | LJSpeech 1.1 | H100-SXM5-80GB | ||
2.4.0a0 | NCF | . Hit Rate at 10 | 212,174,107 samples/sec | 8x H100 | DGX H100 | 24.12-py3 | TF32 | 131072 | MovieLens 20M | H100-SXM5-80GB | ||
2.4.0a0 | FastPitch | . Training Loss | 1,431,758 frames/sec | 8x H100 | DGX H100 | 24.12-py3 | TF32 | 32 | LJSpeech 1.1 | H100-SXM5-80GB |
A30 Training Performance
Framework | Framework Version | Network | Time to Train (mins) |
Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 2.4.0a0 | Tacotron2 | 129 | .53 Training Loss | 237,526 total output mels/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 104 | LJSpeech 1.1 | NVIDIA A30 |
2.4.0a0 | WaveGlow | 402 | -5.88 Training Loss | 1,047,359 output samples/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 10 | LJSpeech 1.1 | NVIDIA A30 | |
2.4.0a0 | GNMT v2 | 49 | 24.23 BLEU Score | 306,590 total tokens/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 128 | wmt16-en-de | NVIDIA A30 | |
2.4.0a0 | NCF | 1 | .96 Hit Rate at 10 | 41,902,951 samples/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 131072 | MovieLens 20M | NVIDIA A30 | |
2.4.0a0 | FastPitch | 153 | .17 Training Loss | 547,338 frames/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 16 | LJSpeech 1.1 | NVIDIA A30 | |
2.4.0a0 | Transformer XL Base | 196 | 22.82 Perplexity | 168,548 total tokens/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 32 | WikiText-103 | NVIDIA A30 | |
2.4.0a0 | EfficientNet-B0 | 785 | 77.15 Top 1 | 11,335 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 128 | Imagenet2012 | NVIDIA A30 | |
2.4.0a0 | EfficientNet-WideSE-B0 | 800 | 77.08 Top 1 | 11,029 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 128 | Imagenet2012 | NVIDIA A30 | |
2.4.0a0 | MoFlow | 99 | 86.8 NUV | 12,284 molecules/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 512 | ZINC | A30 |
A10 Training Performance
Framework | Framework Version | Network | Time to Train (mins) |
Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 2.4.0a0 | Tacotron2 | 145 | .53 Training Loss | 210,315 total output mels/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 104 | LJSpeech 1.1 | NVIDIA A10 |
2.4.0a0 | WaveGlow | 543 | -5.8 Training Loss | 776,028 output samples/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 10 | LJSpeech 1.1 | NVIDIA A10 | |
2.4.0a0 | GNMT v2 | 57 | 24.29 BLEU Score | 262,936 total tokens/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 128 | wmt16-en-de | NVIDIA A10 | |
2.4.0a0 | NCF | 2 | .96 Hit Rate at 10 | 33,005,044 samples/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.09-py3 | TF32 | 131072 | MovieLens 20M | NVIDIA A10 | |
2.4.0a0 | FastPitch | 180 | .17 Training Loss | 462,052 frames/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 16 | LJSpeech 1.1 | NVIDIA A10 | |
2.4.0a0 | Transformer XL Base | 262 | 22.82 Perplexity | 126,073 total tokens/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 32 | WikiText-103 | NVIDIA A10 | |
2.4.0a0 | EfficientNet-B0 | 1,035 | 77.06 Top 1 | 8,508 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 128 | Imagenet2012 | NVIDIA A10 | |
2.4.0a0 | EfficientNet-WideSE-B0 | 1,061 | 77.23 Top 1 | 8,301 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 128 | Imagenet2012 | NVIDIA A10 | |
2.4.0a0 | MoFlow | 100 | 88.14 NUV | 12,237 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.09-py3 | Mixed | 512 | Medical Segmentation Decathlon | NVIDIA A10 |
View More Performance Data
AI Inference
Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.
Learn MoreAI Pipeline
NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.
Learn More