AI Training

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.


Click here to view other performance data.


NVIDIA Performance on MLPerf 3.1 Training Benchmarks


NVIDIA Performance on MLPerf 3.1’s AI Benchmarks: Single Node, Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
NemoStable Diffusion46.8FID⇐90 and and CLIP>=0.158x H100XE9680x8H100-SXM-80GB3.1-2019MixedLAION-400M-filteredH100-SXM5-80GB
MXNetResNet-50 v1.513.475.90% classification8x H100ESC-N8-E113.1-2011MixedImageNetH100-SXM5-80GB
3D U-Net13.10.908 Mean DICE score8x H100AS-8125GS-TNHR3.1-2068MixedKiTS19H100-SXM5-80GB
PyTorchBERT5.40.72 Mask-LM accuracy8x H100ESC-N8-E113.1-2011MixedWikipedia 2020/01/01H100-SXM5-80GB
Mask R-CNN19.20.377 Box min AP and 0.339 Mask min AP8x H100Eos_n13.1-2048MixedCOCO2017H100-SXM5-80GB
RNN-T16.20.058 Word Error Rate8x H100GIGABYTE G593-ZD23.1-2028MixedLibriSpeechH100-SXM5-80GB
RetinaNet36.034.0% mAP8x H100ESC-N8-E113.1-2011MixedA subset of OpenImagesH100-SXM5-80GB
NVIDIA Merlin HugeCTRDLRM-dcnv23.90.80275 AUC8x H100Eos_n13.1-2047MixedCriteo 4TBH100-SXM5-80GB

NVIDIA Performance on MLPerf 3.1’s AI Benchmarks: Multi Node, Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
NVIDIA NeMoGPT358.32.69 log perplexity512x H100Eos_n643.1-2057Mixedc4/en/3.0.1H100-SXM5-80GB
40.62.69 log perplexity768x H100Eos_n963.1-2065Mixedc4/en/3.0.1H100-SXM5-80GB
8.62.69 log perplexity4,096x H100Eos-dfw_n5123.1-2008Mixedc4/en/3.0.1H100-SXM5-80GB
6.02.69 log perplexity6,144x H100Eos-dfw_n7683.1-2009Mixedc4/en/3.0.1H100-SXM5-80GB
4.92.69 log perplexity8,192x H100Eos-dfw_n10243.1-2005Mixedc4/en/3.0.1H100-SXM5-80GB
4.12.69 log perplexity10,240x H100Eos-dfw_n12803.1-2006Mixedc4/en/3.0.1H100-SXM5-80GB
3.92.69 log perplexity10,752x H100Eos-dfw_n13443.1-2007Mixedc4/en/3.0.1H100-SXM5-80GB
Stable Diffusion10.0FID⇐90 and and CLIP>=0.1564x H100Eos_n83.1-2060MixedLAION-400M-filteredH100-SXM5-80GB
2.9FID⇐90 and and CLIP>=0.15512x H100Eos_n643.1-2055MixedLAION-400M-filteredH100-SXM5-80GB
2.5FID⇐90 and and CLIP>=0.151,024x H100Eos_n1283.1-2050MixedLAION-400M-filteredH100-SXM5-80GB
MXNetResNet-50 v1.52.575.90% classification64x H100Eos_n83.1-2058MixedImageNetH100-SXM5-80GB
0.275.90% classification3,584x H100coreweave_hgxh100_n448_ngc23.04_mxnet3.1-2010MixedImageNetH100-SXM5-80GB
3D U-Net1.90.908 Mean DICE score72x H100Eos_n93.1-2063MixedKiTS19H100-SXM5-80GB
0.80.908 Mean DICE score768x H100Eos_n963.1-2064MixedKiTS19H100-SXM5-80GB
PyTorchBERT0.90.72 Mask-LM accuracy64x H100Eos_n83.1-2061MixedWikipedia 2020/01/01H100-SXM5-80GB
0.10.72 Mask-LM accuracy3,472x H100Eos_n4343.1-2053MixedWikipedia 2020/01/01H100-SXM5-80GB
Mask R-CNN4.30.377 Box min AP and 0.339 Mask min AP64x H100Eos_n83.1-2061MixedCOCO2017H100-SXM5-80GB
1.50.377 Box min AP and 0.339 Mask min AP384x H100Eos_n483.1-2054MixedCOCO2017H100-SXM5-80GB
RNN-T4.20.058 Word Error Rate64x H100Eos_n83.1-2061MixedLibriSpeechH100-SXM5-80GB
1.70.058 Word Error Rate512x H100Eos_n643.1-2056MixedLibriSpeechH100-SXM5-80GB
RetinaNet6.134.0% mAP64x H100Eos_n83.1-2062MixedA subset of OpenImagesH100-SXM5-80GB
0.934.0% mAP2,048x H100Eos_n2563.1-2052MixedA subset of OpenImagesH100-SXM5-80GB
NVIDIA Merlin HugeCTRDLRM-dcnv21.40.80275 AUC64x H100Eos_n83.1-2059MixedCriteo 4TBH100-SXM5-80GB
1.00.80275 AUC128x H100Eos_n163.1-2051MixedCriteo 4TBH100-SXM5-80GB

MLPerf™ v3.1 Training Closed: 3.1-2005, 3.1-2006, 3.1-2007, 3.1-2008, 3.1-2009, 3.1-2010, 3.1-2011, 3.1-2019, 3.1-2028, 3.1-2047, 3.1-2048, 3.1-2050, 3.1-2051, 3.1-2052, 3.1-2053, 3.1-2054, 3.1-2055, 3.1-2056, 3.1-2057, 3.1-2058, 3.1-2059, 3.1-2060, 3.1-2061, 3.1-2062, 3.1-2063, 3.1-2064, 3.1-2065, 3.1-2068 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For Training rules and guidelines, click here


NVIDIA Performance on MLPerf 3.0’s Training HPC Benchmarks: Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
PyTorchCosmoFlow2.1Mean average error 0.124512x H100eos3.0-8006MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsH100-SXM5-80GB
DeepCAM0.8IOU 0.822,048x H100eos3.0-8007MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)H100-SXM5-80GB
OpenCatalyst10.7Forces mean absolute error 0.036640x H100eos3.0-8008MixedOpen Catalyst 2020 (OC20) S2EF 2M training split, ID validation setH100-SXM5-80GB
OpenFold7.5Local Distance Difference Test (lDDT-Cα) >= 0.82,080x H100eos3.0-8009MixedOpenProteinSet and Protein Data BankH100-SXM5-80GB

MLPerf™ v3.0 Training HPC Closed: 3.0-8006, 3.0-8007, 3.0-8008, 3.0-8009 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v3.0 Training HPC rules and guidelines, click here



LLM Training Performance on NVIDIA Data Center Products


H100 Training Performance



Framework Framework Version Network Throughput GPU Server Container Sequence Length TP PP Precision Batch Size Dataset GPU Version
Nemo1.19.0GPT3 5B1,290,000 tokens/sec64x H100DGX H10023.05-py3204811FP816PILEH100 SXM5-80GB
1.19.0GPT3 20B311,000 tokens/sec64x H100DGX H10023.05-py3204841FP84PILEH100 SXM5-80GB
1.21.0Llama2 7B102,200 tokens/sec8x H100DGX H10023.08-py3409611FP816PILEH100 SXM5-80GB
1.21.0Llama2 13B101,803 tokens/sec16x H100DGX H10023.08-py3409621FP88PILEH100 SXM5-80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism


Converged Training Performance on NVIDIA Data Center GPUs


H100 Training Performance



Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.2.0a0Tacotron266.56 Training Loss476,492 total output mels/sec8x H100DGX H10023.11-py3Mixed128LJSpeech 1.1H100 SXM5-80GB
2.2.0a0WaveGlow120-5.7 Training Loss3,651,314 output samples/sec8x H100DGX H10023.11-py3Mixed10LJSpeech 1.1H100 SXM5-80GB
2.2.0a0GNMT v21124.46 BLEU Score1,693,303 total tokens/sec8x H100DGX H10023.11-py3Mixed128wmt16-en-deH100 SXM5-80GB
2.1.0a0Transformer10727.77 BLEU Score945,566 Tokens per Second8x H100DGX H10023.08-py3Mixed10240wmt14-en-deH100 SXM5-80GB
2.2.0a0EfficientNet-B41,66681.79 Top 15,233 images/sec8x H100DGX H10023.11-py3Mixed128Imagenet2012H100 SXM5-80GB
2.1.0a0EfficientDet-D0325.33 BBOX mAP2,658 images/sec8x H100DGX H10023.10-py3Mixed150COCO 2017H100 SXM5-80GB
2.1.0a0EfficientNet-WideSE-B41,67682.14 Top 15,207 images/sec8x H100DGX H10023.10-py3Mixed128Imagenet2012H100 SXM5-80GB
Tensorflow2.13.0U-Net Medical1.89 DICE Score2,139 images/sec8x H100DGX H10023.10-py3Mixed8EM segmentation challengeH100 SXM5-80GB
2.14.0Electra Fine Tuning292.39 F14,199 sequences/sec8x H100DGX H10023.11-py3Mixed32SQuaD v1.1H100 SXM5-80GB
2.13.0Wide and Deep4.66 MAP at 1212,217,033 samples/sec8x H100DGX H10023.10-py3Mixed16384Kaggle Outbrain Click PredictionH100 SXM5-80GB

A30 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.2.0a0Tacotron2130.52 Training Loss243,862 total output mels/sec8x A30GIGABYTE G482-Z52-0023.11-py3Mixed104LJSpeech 1.1A30
2.2.0a0WaveGlow398-5.73 Training Loss1,045,568 output samples/sec8x A30GIGABYTE G482-Z52-0023.11-py3Mixed10LJSpeech 1.1A30
2.2.0a0GNMT v24924.21 BLEU Score311,773 total tokens/sec8x A30GIGABYTE G482-Z52-0023.11-py3Mixed128wmt16-en-deA30
2.2.0a0NCF1.96 Hit Rate at 1041,936,790 samples/sec8x A30GIGABYTE G482-Z52-0023.11-py3Mixed131072MovieLens 20MA30
2.1.0a0FastPitch159.17 Training Loss531,205 frames/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed16LJSpeech 1.1A30
2.2.0a0EfficientNet-B078477.1 Top 111,345 images/sec8x A30GIGABYTE G482-Z52-0023.11-py3Mixed128Imagenet2012A30
2.1.0a0EfficientDet-D0933.34 BBOX mAP796 images/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed30COCO 2017A30
2.2.0a0EfficientNet-WideSE-B080077.45 Top 111,208 images/sec8x A30GIGABYTE G482-Z52-0023.11-py3Mixed128Imagenet2012A30
Tensorflow2.13.0U-Net Medical4.89 DICE Score460 images/sec8x A30GIGABYTE G482-Z52-0023.10-py3Mixed8EM segmentation challengeA30
2.14.0Electra Fine Tuning592.65 F1927 sequences/sec8x A30GIGABYTE G482-Z52-0023.11-py3Mixed16SQuaD v1.1A30
2.14.0SIM1.79 AUC2,458,154 samples/sec8x A30GIGABYTE G482-Z52-0023.11-py3Mixed16384Amazon ReviewsA30

A10 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.2.0a0Tacotron2142.51 Training Loss218,286 total output mels/sec8x A10GIGABYTE G482-Z52-0023.11-py3Mixed104LJSpeech 1.1A10
2.2.0a0GNMT v25324.2 BLEU Score283,540 total tokens/sec8x A10GIGABYTE G482-Z52-0023.11-py3Mixed128wmt16-en-deA10
2.2.0a0NCF1.96 Hit Rate at 1035,727,590 samples/sec8x A10GIGABYTE G482-Z52-0023.11-py3Mixed131072MovieLens 20MA10
2.1.0a0FastPitch189.17 Training Loss439,357 frames/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed16LJSpeech 1.1A10
2.2.0a0EfficientNet-B01,03576.81 Top 18,575 images/sec8x A10GIGABYTE G482-Z52-0023.11-py3Mixed128Imagenet2012A10
2.2.0a0EfficientDet-D0964.34 BBOX mAP720 images/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed30COCO 2017A10
2.2.0a0EfficientNet-WideSE-B01,09076.98 Top 18,237 images/sec8x A10GIGABYTE G482-Z52-0023.11-py3Mixed128Imagenet2012A10
Tensorflow2.13.0U-Net Medical4.89 DICE Score352 images/sec8x A10GIGABYTE G482-Z52-0023.10-py3Mixed8EM segmentation challengeA10
2.14.0Electra Fine Tuning692.63 F1788 sequences/sec8x A10GIGABYTE G482-Z52-0023.11-py3Mixed16SQuaD v1.1A10
2.14.0SIM1.8 AUC2,348,730 samples/sec8x A10GIGABYTE G482-Z52-0023.11-py3Mixed16384Amazon ReviewsA10


Converged Training Performance of NVIDIA Data Center GPUs on Cloud

A100 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
MXNet-ResNet-50 v1.57577.04 Top 126,115 images/sec8x A100Azure Standard_ND96amsr_A100_v423.09-py3Mixed256ImageNet2012A100-SXM4-80GB
-ResNet-50 v1.58577.06 Top 123,159 images/sec8x A100GCP A2-HIGHGPU-8G23.08-py3Mixed256ImageNet2012A100-SXM4-40GB


View More Performance Data

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More