AI Training

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.


Click here to view other performance data.


NVIDIA Performance on MLPerf 4.0 Training Benchmarks


NVIDIA Performance on MLPerf 4.0’s AI Benchmarks: Single Node, Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
NVIDIA NemoLLama2-70B-lora24.70.925 cross entropy loss8x H200NVIDIA H2004.0-0071MixedSCROLLs GovReportH200-SXM5-141GB
28.20.925 cross entropy loss8x H100Eos4.0-0050MixedSCROLLs GovReportH100-SXM5-80GB
DGLR-GAT7.772.0 % classification8x H200NVIDIA H2004.0-0068MixedIGBH-FullH200-SXM5-141GB
11.372.0 % classification8x H100Eos4.0-0047MixedIGBH-FullH100-SXM5-80GB
NVIDIA Merlin HugeCTRDLRM-dcnv23.50.80275 AUC8x H200NVIDIA H2004.0-0070MixedCriteo 4TBH200-SXM5-141GB
3.90.80275 AUC8x H100Eos4.0-0049MixedCriteo 4TBH100-SXM5-80GB
NVIDIA NemoStable Diffusion v2.041.3FID⇐90 and CLIP>=0.158x H200NVIDIA H2004.0-0071MixedLAION-400M-filteredH200-SXM5-141GB
42.2FID⇐90 and CLIP>=0.158x H100Eos4.0-0050MixedLAION-400M-filteredH100-SXM5-80GB
PyTorchBERT5.20.72 Mask-LM accuracy8x H200NVIDIA H2004.0-0072MixedWikipedia 2020/01/01H200-SXM5-141GB
5.50.72 Mask-LM accuracy8x H100Eos4.0-0052MixedWikipedia 2020/01/01H100-SXM5-80GB
PyTorchRetinaNet34.334.0% mAP8x H200NVIDIA H2004.0-0073MixedSubset of OpenImagesH200-SXM5-141GB
35.534.0% mAP8x H100Eos4.0-0051MixedSubset of OpenImagesH100-SXM5-80GB
MXNet3D U-Net11.50.908 Mean DICE score8x H200NVIDIA H2004.0-0069MixedKiTS19H200-SXM5-141GB
12.10.908 Mean DICE score8x H100Eos4.0-0048MixedKiTS19H100-SXM5-80GB
MXNetResNet-50 v1.512.175.90% classification8x H200NVIDIA H2004.0-0069MixedImageNetH200-SXM5-141GB
13.375.90% classification8x H100Eos4.0-0048MixedImageNetH100-SXM5-80GB

NVIDIA Performance on MLPerf 4.0’s AI Benchmarks: Multi Node, Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
NVIDIA NeMoGPT350.72.69 log perplexity512x H100Eos_n644.0-0059Mixedc4/en/3.0.1H100-SXM5-80GB
3.72.69 log perplexity10,752x H100Eos-dfw_n13444.0-0006Mixedc4/en/3.0.1H100-SXM5-80GB
3.42.69 log perplexity11,616x H100Eos-dfw_n14524.0-0007Mixedc4/en/3.0.1H100-SXM5-80GB
NVIDIA NeMoLLama2-70B-lora5.30.925 cross entropy loss64x H100Eos_n84.0-0063MixedSCROLLs GovReportH100-SXM5-80GB
1.50.925 cross entropy loss1,024x H100Eos_n1284.0-0053MixedSCROLLs GovReportH100-SXM5-80GB
DGLR-GAT2.772.0 % classification64x H100Eos_n84.0-0060MixedIGBH-FullH100-SXM5-80GB
1.172.0 % classification512x H100Eos_n644.0-0058MixedIGBH-FullH100-SXM5-80GB
NVIDIA Merlin HugeCTRDLRM-dcnv21.40.80275 AUC64x H100Eos_n84.0-0062MixedCriteo 4TBH100-SXM5-80GB
1.00.80275 AUC128x H100Eos_n164.0-0054MixedCriteo 4TBH100-SXM5-80GB
NVIDIA NeMoStable Diffusion v2.06.7FID⇐90 and CLIP>=0.1564x H100Eos_n84.0-0063MixedLAION-400M-filteredH100-SXM5-80GB
1.8FID⇐90 and CLIP>=0.15512x H100Eos_n644.0-0059MixedLAION-400M-filteredH100-SXM5-80GB
1.4FID⇐90 and CLIP>=0.151,024x H100Eos_n1284.0-0053MixedLAION-400M-filteredH100-SXM5-80GB
PyTorchBERT0.90.72 Mask-LM accuracy64x H100Eos_n84.0-0064MixedWikipedia 2020/01/01H100-SXM5-80GB
0.10.72 Mask-LM accuracy3,472x H100Eos_n4344.0-0057MixedWikipedia 2020/01/01H100-SXM5-80GB
PyTorchRetinaNet6.134.0% mAP64x H100Eos_n84.0-0065MixedSubset of OpenImagesH100-SXM5-80GB
0.834.0% mAP2,528x H100Eos_n3164.0-0056MixedSubset of OpenImagesH100-SXM5-80GB
MXNet3D U-Net1.90.908 Mean DICE score72x H100Eos_n94.0-0066MixedKiTS19H100-SXM5-80GB
0.80.908 Mean DICE score768x H100Eos_n964.0-0067MixedKiTS19H100-SXM5-80GB
MXNetResNet-50 v1.52.575.90% classification64x H100Eos_n84.0-0061MixedImageNetH100-SXM5-80GB
0.275.90% classification3,584x H100NVIDIA+CoreWeave Joint Submission4.0-0008MixedImageNetH100-SXM5-80GB

MLPerf™ v4.0 Training Closed: 4.0-0006, 4.0-0007, 4.0-0008, 4.0-0047, 4.0-0048, 4.0-0049, 4.0-0050, 4.0-0051, 4.0-0052, 4.0-0053, 4.0-0054, 4.0-0055, 4.0-0056, 4.0-0057, 4.0-0058, 4.0-0059, 4.0-0060, 4.0-0061, 4.0-0062, 4.0-0063, 4.0-0064, 4.0-0065, 4.0-0066, 4.0-0067, 4.0-0068, 4.0-0069, 4.0-0070, 4.0-0071, 4.0-0072, 4.0-0073 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For Training rules and guidelines, click here


NVIDIA Performance on MLPerf 3.0’s Training HPC Benchmarks: Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
PyTorchCosmoFlow2.1Mean average error 0.124512x H100eos3.0-8006MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsH100-SXM5-80GB
DeepCAM0.8IOU 0.822,048x H100eos3.0-8007MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)H100-SXM5-80GB
OpenCatalyst10.7Forces mean absolute error 0.036640x H100eos3.0-8008MixedOpen Catalyst 2020 (OC20) S2EF 2M training split, ID validation setH100-SXM5-80GB
OpenFold7.5Local Distance Difference Test (lDDT-Cα) >= 0.82,080x H100eos3.0-8009MixedOpenProteinSet and Protein Data BankH100-SXM5-80GB

MLPerf™ v3.0 Training HPC Closed: 3.0-8006, 3.0-8007, 3.0-8008, 3.0-8009 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v3.0 Training HPC rules and guidelines, click here



LLM Training Performance on NVIDIA Data Center Products


H100 Training Performance



Framework Framework Version Network Time to Train (days) Throughput per GPU GPU Server Container Sequence Length TP PP Precision Global Batch Size GPU Version
Nemo1.23GPT3 5B0.523,574 tokens/sec64x H100Eosnemo:24.032,04811FP82,048H100 SXM5 80GB
1.23GPT3 20B25,528 tokens/sec64x H100Eosnemo:24.032,04821FP8256H100 SXM5 80GB
1.23Llama2 7B0.716,290 tokens/sec8x H100Eosnemo:24.034,09611FP8128H100 SXM5 80GB
1.23Llama2 13B1.48,317 tokens/sec16x H100Eosnemo:24.034,09614FP8128H100 SXM5 80GB
1.23Llama2 70B6.61,725 tokens/sec64x H100Eosnemo:24.034,09644FP8128H100 SXM5 80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism
Time to Train is estimated time to train on 1T tokens with 1K GPUs


Converged Training Performance on NVIDIA Data Center GPUs


H100 Training Performance



Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.3.0a0Tacotron267.56 Training Loss469,109 total output mels/sec8x H100DGX H10024.02-py3Mixed128LJSpeech 1.1H100 SXM5 80GB
2.3.0a0WaveGlow119-5.8 Training Loss3,645,916 output samples/sec8x H100DGX H10024.02-py3Mixed10LJSpeech 1.1H100 SXM5 80GB
2.3.0a0GNMT v2924.15 BLEU Score1,699,570 total tokens/sec8x H100DGX H10023.12-py3Mixed128wmt16-en-deH100 SXM5 80GB
2.3.0a0NCF0.27.96 Hit Rate at 10218,094,053 samples/sec8x H100DGX H10024.02-py3Mixed131072MovieLens 20MH100 SXM5 80GB
2.3.0a0FastPitch75.17 Training Loss1,331,733 frames/sec8x H100DGX H10024.02-py3TF3232LJSpeech 1.1H100 SXM5 80GB
2.3.0a0Transformer XL Large31817.83 Perplexity262,462 total tokens/sec8x H100DGX H10024.02-py3Mixed16WikiText-103H100 SXM5 80GB
2.3.0a0Transformer XL Base14121.61 Perplexity952,253 total tokens/sec8x H100DGX H10024.02-py3Mixed128WikiText-103H100 SXM5 80GB
2.3.0a0EfficientNet-B41,66782.02 Top 15,231 images/sec8x H100DGX H10024.02-py3Mixed128Imagenet2012H100 SXM5 80GB
2.1.0a0EfficientDet-D0325.33 BBOX mAP2,658 images/sec8x H100DGX H10023.10-py3Mixed150COCO 2017H100 SXM5 80GB
2.3.0a0EfficientNet-WideSE-B41,67382.01 Top 15,218 images/sec8x H100DGX H10024.02-py3Mixed128Imagenet2012H100 SXM5 80GB
2.2.0a0TFT-Electricity2.03 Test P90145,082 items/sec8x H100DGX H10023.12-py3Mixed1024ElectricityH100 SXM5 80GB
2.3.0a0HiFiGAN9489.42 Training Loss115,461 total output mels/sec8x H100DGX H10024.02-py3Mixed16LJSpeech-1.1H100 SXM5 80GB
2.3.0a0GPUNet-01,05278.91 Top 19,950 images/sec8x H100DGX H10024.02-py3Mixed192Imagenet2012H100 SXM5 80GB
2.3.0a0GPUNet-196080.45 Top 110,946 images/sec8x H100DGX H10024.02-py3Mixed192Imagenet2012H100 SXM5 80GB
2.3.0a0MoFlow3589.67 NUV46,451 molecules/sec8x H100DGX H10024.02-py3Mixed512ZINCH100 SXM5 80GB
Tensorflow2.13.0U-Net Medical1.89 DICE Score2,139 images/sec8x H100DGX H10023.10-py3Mixed8EM segmentation challengeH100 SXM5-80GB
2.15.0Electra Fine Tuning292.59 F15,062 sequences/sec8x H100DGX H10024.02-py3Mixed32SQuaD v1.1H100 SXM5 80GB
2.13.0Wide and Deep4.66 MAP at 1212,217,033 samples/sec8x H100DGX H10023.10-py3Mixed16384Kaggle Outbrain Click PredictionH100 SXM5 80GB

A30 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.3.0a0Tacotron2131.52 Training Loss232,954 total output mels/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed104LJSpeech 1.1A30
2.3.0a0WaveGlow403. Training Loss1,042,579 output samples/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed10LJSpeech 1.1A30
2.3.0a0GNMT v24924.21 BLEU Score309,310 total tokens/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed128wmt16-en-deA30
2.3.0a0NCF1.96 Hit Rate at 1041,848,626 samples/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed131072MovieLens 20MA30
2.3.0a0FastPitch156.17 Training Loss545,724 frames/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed16LJSpeech 1.1A30
2.3.0a0Transformer XL Base19822.87 Perplexity168,704 total tokens/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed32WikiText-103A30
2.3.0a0EfficientNet-B079377.13 Top 111,235 images/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed128Imagenet2012A30
2.3.0a0EfficientNet-WideSE-B082077.21 Top 110,863 images/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed128Imagenet2012A30
2.2.0a0MoFlow10087.86 NUV12,351 molecules/sec8x A30GIGABYTE G482-Z52-0023.12-py3Mixed512ZINCA30
Tensorflow2.13.0U-Net Medical4.89 DICE Score460 images/sec8x A30GIGABYTE G482-Z52-0023.10-py3Mixed8EM segmentation challengeA30
2.15.0Electra Fine Tuning592.63 F11,024 sequences/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed16SQuaD v1.1A30
2.14.0SIM1.81 AUC2,481,945 samples/sec8x A30GIGABYTE G482-Z52-0023.12-py3Mixed16384Amazon ReviewsA30

A10 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.3.0a0Tacotron2144.53 Training Loss214,246 total output mels/sec8x A10GIGABYTE G482-Z52-0024.02-py3Mixed104LJSpeech 1.1A10
2.3.0a0WaveGlow541-5.73 Training Loss776,764 output samples/sec8x A10GIGABYTE G482-Z52-0024.02-py3Mixed10LJSpeech 1.1A10
2.3.0a0GNMT v25324.2 BLEU Score282,447 total tokens/sec8x A10GIGABYTE G482-Z52-0024.02-py3Mixed128wmt16-en-deA10
2.3.0a0NCF2.96 Hit Rate at 1032,920,397 samples/sec8x A10GIGABYTE G482-Z52-0024.02-py3TF32131072MovieLens 20MA10
2.3.0a0FastPitch180.17 Training Loss460,415 frames/sec8x A10GIGABYTE G482-Z52-0024.02-py3Mixed16LJSpeech 1.1A10
2.3.0a0EfficientNet-B01,04577.11 Top 18,625 images/sec8x A10GIGABYTE G482-Z52-0024.02-py3Mixed128Imagenet2012A10
2.3.0a0EfficientNet-WideSE-B01,07677.31 Top 18,487 images/sec8x A10GIGABYTE G482-Z52-0024.02-py3Mixed128Imagenet2012A10
2.2.0a0MoFlow9386.86 NUV13,184 images/sec8x A10GIGABYTE G482-Z52-0023.12-py3Mixed512Medical Segmentation DecathlonA10
Tensorflow2.13.0U-Net Medical4.89 DICE Score352 images/sec8x A10GIGABYTE G482-Z52-0023.10-py3Mixed8EM segmentation challengeA10
2.15.0Electra Fine Tuning592.52 F1826 sequences/sec8x A10GIGABYTE G482-Z52-0024.02-py3Mixed16SQuaD v1.1A10
2.14.0SIM1.8 AUC2,346,013 samples/sec8x A10GIGABYTE G482-Z52-0023.12-py3Mixed16384Amazon ReviewsA10



View More Performance Data

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More