AI Training

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.


Click here to view other performance data.


NVIDIA Performance on MLPerf 5.0 Training Benchmarks


NVIDIA Performance on MLPerf 5.0’s AI Benchmarks: Single Node, Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
NemoLlama2-70B-lora110.925 Eval loss8x GB200BM.GPU.GB200.45.0-0020MixedSCROLLS GovReportNVIDIA Blackwell GPU (GB200)
11.20.925 Eval loss8x B200SYS-422GA-NBRT-LCC5.0-0089MixedSCROLLS GovReportNVIDIA Blackwell GPU (B200-SXM-180GB)
Stable Diffusion12.9FID⇐90 and and CLIP>=0.158x GB200Tyche (1x NVIDIA GB200 NVL72)5.0-0071MixedLAION-400M-filteredNVIDIA Blackwell GPU (GB200)
13FID⇐90 and and CLIP>=0.158x B200SYS-422GA-NBRT-LCC5.0-0089MixedLAION-400M-filteredNVIDIA Blackwell GPU (B200-SXM-180GB)
PyTorchBERT3.40.72 Mask-LM accuracy8x GB200Tyche (1x NVIDIA GB200 NVL72)5.0-0072MixedWikipedia 2020/01/01NVIDIA Blackwell GPU (GB200)
3.50.72 Mask-LM accuracy8x B2001xXE9680Lx8B200-SXM-180GB5.0-0033MixedWikipedia 2020/01/01NVIDIA Blackwell GPU (B200-SXM-180GB)
RetinaNet22.334.0% mAP8x GB200Tyche (1x NVIDIA GB200 NVL72)5.0-0072MixedA subset of OpenImagesNVIDIA Blackwell GPU (GB200)
21.834.0% mAP8x B200AS-A126GS-TNBR5.0-0085MixedA subset of OpenImagesNVIDIA Blackwell GPU (B200-SXM-180GB)
DGLR-GAT572.0 % classification8x GB200Tyche (1x NVIDIA GB200 NVL72)5.0-0069MixedIGBH-FullNVIDIA Blackwell GPU (GB200)
5.172.0 % classification8x B200G893-SD15.0-0046MixedIGBH-FullNVIDIA Blackwell GPU (B200-SXM-180GB)
NVIDIA Merlin HugeCTRDLRM-dcnv22.20.80275 AUC8x GB200Tyche (1x NVIDIA GB200 NVL72)5.0-0070MixedCriteo 3.5TB Click Logs (multi-hot variant)NVIDIA Blackwell GPU (GB200)
2.30.80275 AUC8x B200Nyx (1x NVIDIA DGX B200)5.0-0061MixedCriteo 3.5TB Click Logs (multi-hot variant)NVIDIA Blackwell GPU (B200-SXM-180GB)

NVIDIA Performance on MLPerf 5.0’s AI Benchmarks: Multi Node, Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
NVIDIA NeMoLlama 3.1 405B240.35.6 log perplexity256x GB200Tyche (4x NVIDIA GB200 NVL72)5.0-0075Mixedc4/en/3.0.1NVIDIA Blackwell GPU (GB200)
121.15.6 log perplexity512x GB200Carina (8x NVIDIA GB200 NVL72)5.0-0005Mixedc4/en/3.0.1NVIDIA Blackwell GPU (GB200)
62.15.6 log perplexity1,024x GB200Carina (16x NVIDIA GB200 NVL72)5.0-0001Mixedc4/en/3.0.1NVIDIA Blackwell GPU (GB200)
27.35.6 log perplexity2,496x GB200Carina (39x NVIDIA GB200 NVL72)5.0-0004Mixedc4/en/3.0.1NVIDIA Blackwell GPU (GB200)
20.85.6 log perplexity8,192x H100Eos-dfw (1024x NVIDIA HGX H100)5.0-0010Mixedc4/en/3.0.1NVIDIA H100-SXM5-80GB
Llama2-70B-lora1.90.925 Eval loss64x GB20016xXE9712x4GB2005.0-0031MixedSCROLLS GovReportNVIDIA Blackwell GPU (GB200)
1.10.925 Eval loss144x GB200Tyche (2x NVIDIA GB200 NVL72)5.0-0073MixedSCROLLS GovReportNVIDIA Blackwell GPU (GB200)
0.60.925 Eval loss512x GB200Tyche (8x NVIDIA GB200 NVL72)5.0-0076MixedSCROLLS GovReportNVIDIA Blackwell GPU (GB200)
6.10.925 Eval loss16x B200AS-4126GS-NBR-LCC_N25.0-0083MixedSCROLLS GovReportNVIDIA Blackwell GPU (B200-SXM-180GB)
20.925 Eval loss64x B200BM.GPU.B200.85.0-0018MixedSCROLLS GovReportNVIDIA Blackwell GPU (B200-SXM-180GB)
Stable Diffusion7.6FID⇐90 and and CLIP>=0.1516x GB2004xXE9712x4GB2005.0-0040MixedLAION-400M-filteredNVIDIA Blackwell GPU (GB200)
4.3FID⇐90 and and CLIP>=0.1532x GB2008xXE9712x4GB2005.0-0041MixedLAION-400M-filteredNVIDIA Blackwell GPU (GB200)
2.7FID⇐90 and and CLIP>=0.1564x GB20016xXE9712x4GB2005.0-0031MixedLAION-400M-filteredNVIDIA Blackwell GPU (GB200)
1FID⇐90 and and CLIP>=0.15512x GB200Tyche (8x NVIDIA GB200 NVL72)5.0-0076MixedLAION-400M-filteredNVIDIA Blackwell GPU (GB200)
2.8FID⇐90 and and CLIP>=0.1564x B200BM.GPU.B200.85.0-0018MixedLAION-400M-filteredNVIDIA Blackwell GPU (B200-SXM-180GB)
PyTorchBERT2.10.72 Mask-LM accuracy16x GB2004xXE9712x4GB2005.0-0040MixedWikipedia 2020/01/01NVIDIA Blackwell GPU (GB200)
1.50.72 Mask-LM accuracy32x GB2008xXE9712x4GB2005.0-0041MixedWikipedia 2020/01/01NVIDIA Blackwell GPU (GB200)
0.70.72 Mask-LM accuracy64x GB200Tyche (1x NVIDIA GB200 NVL72)5.0-0065MixedWikipedia 2020/01/01NVIDIA Blackwell GPU (GB200)
0.30.72 Mask-LM accuracy512x GB200Tyche (8x NVIDIA GB200 NVL72)5.0-0077MixedWikipedia 2020/01/01NVIDIA Blackwell GPU (GB200)
2.30.72 Mask-LM accuracy16x B2002xXE9680Lx8B200-SXM-180GB5.0-0037MixedWikipedia 2020/01/01NVIDIA Blackwell GPU (B200-SXM-180GB)
RetinaNet12.334.0% mAP16x GB2004xXE9712x4GB2005.0-0040MixedA subset of OpenImagesNVIDIA Blackwell GPU (GB200)
934.0% mAP32x GB2008xXE9712x4GB2005.0-0041MixedA subset of OpenImagesNVIDIA Blackwell GPU (GB200)
4.334.0% mAP64x GB20016xXE9712x4GB2005.0-0031MixedA subset of OpenImagesNVIDIA Blackwell GPU (GB200)
1.434.0% mAP512x GB200Tyche (8x NVIDIA GB200 NVL72)5.0-0077MixedA subset of OpenImagesNVIDIA Blackwell GPU (GB200)
1434.0% mAP16x B2002xXE9680Lx8B200-SXM-180GB5.0-0037MixedA subset of OpenImagesNVIDIA Blackwell GPU (B200-SXM-180GB)
4.434.0% mAP64x B200BM.GPU.B200.85.0-0018MixedA subset of OpenImagesNVIDIA Blackwell GPU (B200-SXM-180GB)
DGLR-GAT1.172.0 % classification72x GB200Tyche (1x NVIDIA GB200 NVL72)5.0-0066MixedIGBH-FullNVIDIA Blackwell GPU (GB200)
0.872.0 % classification256x GB200Tyche (4x NVIDIA GB200 NVL72)5.0-0074MixedIGBH-FullNVIDIA Blackwell GPU (GB200)
NVIDIA Merlin HugeCTRDLRM-dcnv20.70.80275 AUC64x GB200SRS-GB200-NVL72-M1 (16x ARS-121GL-NBO)5.0-0087MixedCriteo 3.5TB Click Logs (multi-hot variant)NVIDIA Blackwell GPU (GB200)

MLPerf™ v5.0 Training Closed: 5.0-0001, 5.0-0004, 5.0-0005, 5.0-0010, 5.0-0018, 5.0-0020, 5.0-0031, 5.0-0033, 5.0-0037, 5.0-0040, 5.0-0041, 5.0-0046, 5.0-0061, 5.0-0065, 5.0-0066, 5.0-0068, 5.0-0069, 5.0-0070, 5.0-0071, 5.0-0072, 5.0-0073, 5.0-0074, 5.0-0075, 5.0-0076, 5.0-0077, 5.0-0083, 5.0-0085, 5.0-0087, 5.0-0089 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For Training rules and guidelines, click here


NVIDIA Performance on MLPerf 3.0’s Training HPC Benchmarks: Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
PyTorchCosmoFlow2.1Mean average error 0.124512x H100eos3.0-8006MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsH100-SXM5-80GB
DeepCAM0.8IOU 0.822,048x H100eos3.0-8007MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)H100-SXM5-80GB
OpenCatalyst10.7Forces mean absolute error 0.036640x H100eos3.0-8008MixedOpen Catalyst 2020 (OC20) S2EF 2M training split, ID validation setH100-SXM5-80GB
OpenFold7.5Local Distance Difference Test (lDDT-Cα) >= 0.82,080x H100eos3.0-8009MixedOpenProteinSet and Protein Data BankH100-SXM5-80GB

MLPerf™ v3.0 Training HPC Closed: 3.0-8006, 3.0-8007, 3.0-8008, 3.0-8009 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v3.0 Training HPC rules and guidelines, click here



LLM Training Performance on NVIDIA Data Center Products


B200 Training Performance



Framework Model Time to Train (days) Throughput per GPU GPU Server Container Version Sequence Length TP PP CP Precision Global Batch Size GPU Version
NemoGPT3 175B71,523 tokens/sec512x B200DGX B200nemo:25.042048441FP82048NVIDIA B200
Llama3 70B33,562 tokens/sec64x B200DGX B200nemo:25.048192242FP8128NVIDIA B200
Llama3 405B17651 tokens/sec128x B200DGX B200nemo:25.048192482FP864NVIDIA B200
Nemotron 15B0.716,222 tokens/sec64x B200DGX B200nemo:25.044096111FP8256NVIDIA B200
Nemotron 340B18632 tokens/sec128x B200DGX B200nemo:25.044096841FP832NVIDIA B200
Mixtral 8x7B0.617,617 tokens/sec64x B200DGX B200nemo:25.044096111FP8256NVIDIA B200
Mixtral 8x22B52,399 tokens/sec256x B200DGX B200nemo:25.0465536248FP81NVIDIA B200

TP: Tensor Parallelism
PP: Pipeline Parallelism
CP: Context Parallelism
Time to Train is estimated time to train on 1T tokens with 1K GPUs


Converged Training Performance on NVIDIA Data Center GPUs


H200 Training Performance



Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.4.0a0Tacotron265.56 Training Loss496,465 total output mels/sec8x H200DGX H20024.12-py3TF32128LJSpeech 1.1NVIDIA H200
2.4.0a0WaveGlow106-5.7 Training Loss4,124,433 output samples/sec8x H200DGX H20024.12-py3Mixed10LJSpeech 1.1NVIDIA H200
2.4.0a0NCF.96 Hit Rate at 10252,318,096 samples/sec8x H200DGX H20024.12-py3Mixed131072MovieLens 20MNVIDIA H200
2.4.0a0FastPitch66.17 Training Loss1,465,568 frames/sec8x H200DGX H20024.12-py3TF3232LJSpeech 1.1NVIDIA H200
2.4.0a0Transformer XL Large26417.82 Perplexity317,663 total tokens/sec8x H200DGX H20024.12-py3Mixed16WikiText-103NVIDIA H200
2.4.0a0Transformer XL Base11621.6 Perplexity1,163,450 total tokens/sec8x H200DGX H20024.12-py3Mixed128WikiText-103NVIDIA H200
2.4.0a0EfficientDet-D0303.33 BBOX mAP2,793 images/sec8x H200DGX H20024.12-py3Mixed150COCO 2017NVIDIA H200
2.4.0a0HiFiGAN9159.75 Training Loss120,606 total output mels/sec8x H200DGX H20024.12-py3Mixed16LJSpeech-1.1NVIDIA H200

H100 Training Performance



Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.4.0a0Tacotron2. Training Loss477,113 total output mels/sec8x H100DGX H10024.12-py3Mixed128LJSpeech 1.1H100-SXM5-80GB
2.4.0a0WaveGlow. Training Loss3,809,464 output samples/sec8x H100DGX H10024.12-py3Mixed10LJSpeech 1.1H100-SXM5-80GB
2.4.0a0NCF. Hit Rate at 10212,174,107 samples/sec8x H100DGX H10024.12-py3TF32131072MovieLens 20MH100-SXM5-80GB
2.4.0a0FastPitch. Training Loss1,431,758 frames/sec8x H100DGX H10024.12-py3TF3232LJSpeech 1.1H100-SXM5-80GB

A30 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.4.0a0Tacotron2129.53 Training Loss237,526 total output mels/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed104LJSpeech 1.1NVIDIA A30
2.4.0a0WaveGlow402-5.88 Training Loss1,047,359 output samples/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed10LJSpeech 1.1NVIDIA A30
2.4.0a0GNMT v24924.23 BLEU Score306,590 total tokens/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed128wmt16-en-deNVIDIA A30
2.4.0a0NCF1.96 Hit Rate at 1041,902,951 samples/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed131072MovieLens 20MNVIDIA A30
2.4.0a0FastPitch153.17 Training Loss547,338 frames/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed16LJSpeech 1.1NVIDIA A30
2.4.0a0Transformer XL Base19622.82 Perplexity168,548 total tokens/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed32WikiText-103NVIDIA A30
2.4.0a0EfficientNet-B078577.15 Top 111,335 images/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed128Imagenet2012NVIDIA A30
2.4.0a0EfficientNet-WideSE-B080077.08 Top 111,029 images/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed128Imagenet2012NVIDIA A30
2.4.0a0MoFlow9986.8 NUV12,284 molecules/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed512ZINCA30

A10 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.4.0a0Tacotron2145.53 Training Loss210,315 total output mels/sec8x A10GIGABYTE G482-Z52-0024.09-py3Mixed104LJSpeech 1.1NVIDIA A10
2.4.0a0WaveGlow543-5.8 Training Loss776,028 output samples/sec8x A10GIGABYTE G482-Z52-0024.09-py3Mixed10LJSpeech 1.1NVIDIA A10
2.4.0a0GNMT v25724.29 BLEU Score262,936 total tokens/sec8x A10GIGABYTE G482-Z52-0024.09-py3Mixed128wmt16-en-deNVIDIA A10
2.4.0a0NCF2.96 Hit Rate at 1033,005,044 samples/sec8x A10GIGABYTE G482-Z52-0024.09-py3TF32131072MovieLens 20MNVIDIA A10
2.4.0a0FastPitch180.17 Training Loss462,052 frames/sec8x A10GIGABYTE G482-Z52-0024.09-py3Mixed16LJSpeech 1.1NVIDIA A10
2.4.0a0Transformer XL Base26222.82 Perplexity126,073 total tokens/sec8x A10GIGABYTE G482-Z52-0024.09-py3Mixed32WikiText-103NVIDIA A10
2.4.0a0EfficientNet-B01,03577.06 Top 18,508 images/sec8x A10GIGABYTE G482-Z52-0024.09-py3Mixed128Imagenet2012NVIDIA A10
2.4.0a0EfficientNet-WideSE-B01,06177.23 Top 18,301 images/sec8x A10GIGABYTE G482-Z52-0024.09-py3Mixed128Imagenet2012NVIDIA A10
2.4.0a0MoFlow10088.14 NUV12,237 images/sec8x A10GIGABYTE G482-Z52-0024.09-py3Mixed512Medical Segmentation DecathlonNVIDIA A10



View More Performance Data

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More