AI Training

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.


Click here to view other performance data.


NVIDIA Performance on MLPerf 5.0 Training Benchmarks


NVIDIA Performance on MLPerf 5.0’s AI Benchmarks: Single Node, Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
NemoLlama2-70B-lora110.925 Eval loss8x GB200BM.GPU.GB200.45.0-0020MixedSCROLLS GovReportNVIDIA Blackwell GPU (GB200)
11.20.925 Eval loss8x B200SYS-422GA-NBRT-LCC5.0-0089MixedSCROLLS GovReportNVIDIA Blackwell GPU (B200-SXM-180GB)
Stable Diffusion12.9FID⇐90 and and CLIP>=0.158x GB200Tyche (1x NVIDIA GB200 NVL72)5.0-0071MixedLAION-400M-filteredNVIDIA Blackwell GPU (GB200)
13FID⇐90 and and CLIP>=0.158x B200SYS-422GA-NBRT-LCC5.0-0089MixedLAION-400M-filteredNVIDIA Blackwell GPU (B200-SXM-180GB)
PyTorchBERT3.40.72 Mask-LM accuracy8x GB200Tyche (1x NVIDIA GB200 NVL72)5.0-0072MixedWikipedia 2020/01/01NVIDIA Blackwell GPU (GB200)
3.50.72 Mask-LM accuracy8x B2001xXE9680Lx8B200-SXM-180GB5.0-0033MixedWikipedia 2020/01/01NVIDIA Blackwell GPU (B200-SXM-180GB)
RetinaNet22.334.0% mAP8x GB200Tyche (1x NVIDIA GB200 NVL72)5.0-0072MixedA subset of OpenImagesNVIDIA Blackwell GPU (GB200)
21.834.0% mAP8x B200AS-A126GS-TNBR5.0-0085MixedA subset of OpenImagesNVIDIA Blackwell GPU (B200-SXM-180GB)
DGLR-GAT572.0 % classification8x GB200Tyche (1x NVIDIA GB200 NVL72)5.0-0069MixedIGBH-FullNVIDIA Blackwell GPU (GB200)
5.172.0 % classification8x B200G893-SD15.0-0046MixedIGBH-FullNVIDIA Blackwell GPU (B200-SXM-180GB)
NVIDIA Merlin HugeCTRDLRM-dcnv22.20.80275 AUC8x GB200Tyche (1x NVIDIA GB200 NVL72)5.0-0070MixedCriteo 3.5TB Click Logs (multi-hot variant)NVIDIA Blackwell GPU (GB200)
2.30.80275 AUC8x B200Nyx (1x NVIDIA DGX B200)5.0-0061MixedCriteo 3.5TB Click Logs (multi-hot variant)NVIDIA Blackwell GPU (B200-SXM-180GB)

NVIDIA Performance on MLPerf 5.0’s AI Benchmarks: Multi Node, Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
NVIDIA NeMoLlama 3.1 405B240.35.6 log perplexity256x GB200Tyche (4x NVIDIA GB200 NVL72)5.0-0075Mixedc4/en/3.0.1NVIDIA Blackwell GPU (GB200)
121.15.6 log perplexity512x GB200Carina (8x NVIDIA GB200 NVL72)5.0-0005Mixedc4/en/3.0.1NVIDIA Blackwell GPU (GB200)
62.15.6 log perplexity1,024x GB200Carina (16x NVIDIA GB200 NVL72)5.0-0001Mixedc4/en/3.0.1NVIDIA Blackwell GPU (GB200)
27.35.6 log perplexity2,496x GB200Carina (39x NVIDIA GB200 NVL72)5.0-0004Mixedc4/en/3.0.1NVIDIA Blackwell GPU (GB200)
20.85.6 log perplexity8,192x H100Eos-dfw (1024x NVIDIA HGX H100)5.0-0010Mixedc4/en/3.0.1NVIDIA H100-SXM5-80GB
Llama2-70B-lora1.90.925 Eval loss64x GB20016xXE9712x4GB2005.0-0031MixedSCROLLS GovReportNVIDIA Blackwell GPU (GB200)
1.10.925 Eval loss144x GB200Tyche (2x NVIDIA GB200 NVL72)5.0-0073MixedSCROLLS GovReportNVIDIA Blackwell GPU (GB200)
0.60.925 Eval loss512x GB200Tyche (8x NVIDIA GB200 NVL72)5.0-0076MixedSCROLLS GovReportNVIDIA Blackwell GPU (GB200)
6.10.925 Eval loss16x B200AS-4126GS-NBR-LCC_N25.0-0083MixedSCROLLS GovReportNVIDIA Blackwell GPU (B200-SXM-180GB)
20.925 Eval loss64x B200BM.GPU.B200.85.0-0018MixedSCROLLS GovReportNVIDIA Blackwell GPU (B200-SXM-180GB)
Stable Diffusion7.6FID⇐90 and and CLIP>=0.1516x GB2004xXE9712x4GB2005.0-0040MixedLAION-400M-filteredNVIDIA Blackwell GPU (GB200)
4.3FID⇐90 and and CLIP>=0.1532x GB2008xXE9712x4GB2005.0-0041MixedLAION-400M-filteredNVIDIA Blackwell GPU (GB200)
2.7FID⇐90 and and CLIP>=0.1564x GB20016xXE9712x4GB2005.0-0031MixedLAION-400M-filteredNVIDIA Blackwell GPU (GB200)
1FID⇐90 and and CLIP>=0.15512x GB200Tyche (8x NVIDIA GB200 NVL72)5.0-0076MixedLAION-400M-filteredNVIDIA Blackwell GPU (GB200)
2.8FID⇐90 and and CLIP>=0.1564x B200BM.GPU.B200.85.0-0018MixedLAION-400M-filteredNVIDIA Blackwell GPU (B200-SXM-180GB)
PyTorchBERT2.10.72 Mask-LM accuracy16x GB2004xXE9712x4GB2005.0-0040MixedWikipedia 2020/01/01NVIDIA Blackwell GPU (GB200)
1.50.72 Mask-LM accuracy32x GB2008xXE9712x4GB2005.0-0041MixedWikipedia 2020/01/01NVIDIA Blackwell GPU (GB200)
0.70.72 Mask-LM accuracy64x GB200Tyche (1x NVIDIA GB200 NVL72)5.0-0065MixedWikipedia 2020/01/01NVIDIA Blackwell GPU (GB200)
0.30.72 Mask-LM accuracy512x GB200Tyche (8x NVIDIA GB200 NVL72)5.0-0077MixedWikipedia 2020/01/01NVIDIA Blackwell GPU (GB200)
2.30.72 Mask-LM accuracy16x B2002xXE9680Lx8B200-SXM-180GB5.0-0037MixedWikipedia 2020/01/01NVIDIA Blackwell GPU (B200-SXM-180GB)
RetinaNet12.334.0% mAP16x GB2004xXE9712x4GB2005.0-0040MixedA subset of OpenImagesNVIDIA Blackwell GPU (GB200)
934.0% mAP32x GB2008xXE9712x4GB2005.0-0041MixedA subset of OpenImagesNVIDIA Blackwell GPU (GB200)
4.334.0% mAP64x GB20016xXE9712x4GB2005.0-0031MixedA subset of OpenImagesNVIDIA Blackwell GPU (GB200)
1.434.0% mAP512x GB200Tyche (8x NVIDIA GB200 NVL72)5.0-0077MixedA subset of OpenImagesNVIDIA Blackwell GPU (GB200)
1434.0% mAP16x B2002xXE9680Lx8B200-SXM-180GB5.0-0037MixedA subset of OpenImagesNVIDIA Blackwell GPU (B200-SXM-180GB)
4.434.0% mAP64x B200BM.GPU.B200.85.0-0018MixedA subset of OpenImagesNVIDIA Blackwell GPU (B200-SXM-180GB)
DGLR-GAT1.172.0 % classification72x GB200Tyche (1x NVIDIA GB200 NVL72)5.0-0066MixedIGBH-FullNVIDIA Blackwell GPU (GB200)
0.872.0 % classification256x GB200Tyche (4x NVIDIA GB200 NVL72)5.0-0074MixedIGBH-FullNVIDIA Blackwell GPU (GB200)
NVIDIA Merlin HugeCTRDLRM-dcnv20.70.80275 AUC64x GB200SRS-GB200-NVL72-M1 (16x ARS-121GL-NBO)5.0-0087MixedCriteo 3.5TB Click Logs (multi-hot variant)NVIDIA Blackwell GPU (GB200)

MLPerf™ v5.0 Training Closed: 5.0-0001, 5.0-0004, 5.0-0005, 5.0-0010, 5.0-0018, 5.0-0020, 5.0-0031, 5.0-0033, 5.0-0037, 5.0-0040, 5.0-0041, 5.0-0046, 5.0-0061, 5.0-0065, 5.0-0066, 5.0-0068, 5.0-0069, 5.0-0070, 5.0-0071, 5.0-0072, 5.0-0073, 5.0-0074, 5.0-0075, 5.0-0076, 5.0-0077, 5.0-0083, 5.0-0085, 5.0-0087, 5.0-0089 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For Training rules and guidelines, click here


NVIDIA Performance on MLPerf 3.0’s Training HPC Benchmarks: Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
PyTorchCosmoFlow2.1Mean average error 0.124512x H100eos3.0-8006MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsH100-SXM5-80GB
DeepCAM0.8IOU 0.822,048x H100eos3.0-8007MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)H100-SXM5-80GB
OpenCatalyst10.7Forces mean absolute error 0.036640x H100eos3.0-8008MixedOpen Catalyst 2020 (OC20) S2EF 2M training split, ID validation setH100-SXM5-80GB
OpenFold7.5Local Distance Difference Test (lDDT-Cα) >= 0.82,080x H100eos3.0-8009MixedOpenProteinSet and Protein Data BankH100-SXM5-80GB

MLPerf™ v3.0 Training HPC Closed: 3.0-8006, 3.0-8007, 3.0-8008, 3.0-8009 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v3.0 Training HPC rules and guidelines, click here



LLM Training Performance on NVIDIA Data Center Products


B200 Training Performance



Framework Model Time to Train (days) Throughput per GPU GPU Server Container Version Sequence Length TP PP CP Precision Global Batch Size GPU Version
NemoLlama3 8B0.430,131 tokens/sec8x B200DGX B200nemo:25.078192111FP8128NVIDIA B200
Llama3 70B3.13,690 tokens/sec64x B200DGX B200nemo:25.078192111FP8128NVIDIA B200
Llama3 405B16.8674 tokens/sec128x B200DGX B200nemo:25.078192482FP864NVIDIA B200
Llama4 Scout111,260 tokens/sec64x B200DGX B200nemo:25.078192121FP81024NVIDIA B200
Llama4 Maverick1.29,811 tokens/sec128x B200DGX B200nemo:25.078192121FP81024NVIDIA B200
Mixtral 8x7B0.716,384 tokens/sec64x B200DGX B200nemo:25.074096111FP8256NVIDIA B200
Mixtral 8x22B4.42,548 tokens/sec256x B200DGX B200nemo:25.0765536248FP864NVIDIA B200
Nemotron5-H56B2.74,123 tokens/sec64x B200DGX B200nemo:25.078192411FP8192NVIDIA B200
DeepSeek v36.91,640 tokens/sec256x B200DGX B200nemo:25.0740962161FP82048NVIDIA B200

TP: Tensor Parallelism
PP: Pipeline Parallelism
CP: Context Parallelism
Time to Train is estimated time to train on 1T tokens with 1K GPUs



View More Performance Data

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More