AI Training

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.


Click here to view other performance data.


NVIDIA Performance on MLPerf 3.1 Training Benchmarks


NVIDIA Performance on MLPerf 3.1’s AI Benchmarks: Single Node, Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
NemoStable Diffusion46.8FID⇐90 and and CLIP>=0.158x H100XE9680x8H100-SXM-80GB3.1-2019MixedLAION-400M-filteredH100-SXM5-80GB
MXNetResNet-50 v1.513.475.90% classification8x H100ESC-N8-E113.1-2011MixedImageNetH100-SXM5-80GB
3D U-Net13.10.908 Mean DICE score8x H100AS-8125GS-TNHR3.1-2068MixedKiTS19H100-SXM5-80GB
PyTorchBERT5.40.72 Mask-LM accuracy8x H100ESC-N8-E113.1-2011MixedWikipedia 2020/01/01H100-SXM5-80GB
Mask R-CNN19.20.377 Box min AP and 0.339 Mask min AP8x H100Eos_n13.1-2048MixedCOCO2017H100-SXM5-80GB
RNN-T16.20.058 Word Error Rate8x H100GIGABYTE G593-ZD23.1-2028MixedLibriSpeechH100-SXM5-80GB
RetinaNet36.034.0% mAP8x H100ESC-N8-E113.1-2011MixedA subset of OpenImagesH100-SXM5-80GB
NVIDIA Merlin HugeCTRDLRM-dcnv23.90.80275 AUC8x H100Eos_n13.1-2047MixedCriteo 4TBH100-SXM5-80GB

NVIDIA Performance on MLPerf 3.1’s AI Benchmarks: Multi Node, Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
NVIDIA NeMoGPT358.32.69 log perplexity512x H100Eos_n643.1-2057Mixedc4/en/3.0.1H100-SXM5-80GB
40.62.69 log perplexity768x H100Eos_n963.1-2065Mixedc4/en/3.0.1H100-SXM5-80GB
8.62.69 log perplexity4,096x H100Eos-dfw_n5123.1-2008Mixedc4/en/3.0.1H100-SXM5-80GB
6.02.69 log perplexity6,144x H100Eos-dfw_n7683.1-2009Mixedc4/en/3.0.1H100-SXM5-80GB
4.92.69 log perplexity8,192x H100Eos-dfw_n10243.1-2005Mixedc4/en/3.0.1H100-SXM5-80GB
4.12.69 log perplexity10,240x H100Eos-dfw_n12803.1-2006Mixedc4/en/3.0.1H100-SXM5-80GB
3.92.69 log perplexity10,752x H100Eos-dfw_n13443.1-2007Mixedc4/en/3.0.1H100-SXM5-80GB
Stable Diffusion10.0FID⇐90 and and CLIP>=0.1564x H100Eos_n83.1-2060MixedLAION-400M-filteredH100-SXM5-80GB
2.9FID⇐90 and and CLIP>=0.15512x H100Eos_n643.1-2055MixedLAION-400M-filteredH100-SXM5-80GB
2.5FID⇐90 and and CLIP>=0.151,024x H100Eos_n1283.1-2050MixedLAION-400M-filteredH100-SXM5-80GB
MXNetResNet-50 v1.52.575.90% classification64x H100Eos_n83.1-2058MixedImageNetH100-SXM5-80GB
0.275.90% classification3,584x H100coreweave_hgxh100_n448_ngc23.04_mxnet3.1-2010MixedImageNetH100-SXM5-80GB
3D U-Net1.90.908 Mean DICE score72x H100Eos_n93.1-2063MixedKiTS19H100-SXM5-80GB
0.80.908 Mean DICE score768x H100Eos_n963.1-2064MixedKiTS19H100-SXM5-80GB
PyTorchBERT0.90.72 Mask-LM accuracy64x H100Eos_n83.1-2061MixedWikipedia 2020/01/01H100-SXM5-80GB
0.10.72 Mask-LM accuracy3,472x H100Eos_n4343.1-2053MixedWikipedia 2020/01/01H100-SXM5-80GB
Mask R-CNN4.30.377 Box min AP and 0.339 Mask min AP64x H100Eos_n83.1-2061MixedCOCO2017H100-SXM5-80GB
1.50.377 Box min AP and 0.339 Mask min AP384x H100Eos_n483.1-2054MixedCOCO2017H100-SXM5-80GB
RNN-T4.20.058 Word Error Rate64x H100Eos_n83.1-2061MixedLibriSpeechH100-SXM5-80GB
1.70.058 Word Error Rate512x H100Eos_n643.1-2056MixedLibriSpeechH100-SXM5-80GB
RetinaNet6.134.0% mAP64x H100Eos_n83.1-2062MixedA subset of OpenImagesH100-SXM5-80GB
0.934.0% mAP2,048x H100Eos_n2563.1-2052MixedA subset of OpenImagesH100-SXM5-80GB
NVIDIA Merlin HugeCTRDLRM-dcnv21.40.80275 AUC64x H100Eos_n83.1-2059MixedCriteo 4TBH100-SXM5-80GB
1.00.80275 AUC128x H100Eos_n163.1-2051MixedCriteo 4TBH100-SXM5-80GB

MLPerf™ v3.1 Training Closed: 3.1-2005, 3.1-2006, 3.1-2007, 3.1-2008, 3.1-2009, 3.1-2010, 3.1-2011, 3.1-2019, 3.1-2028, 3.1-2047, 3.1-2048, 3.1-2050, 3.1-2051, 3.1-2052, 3.1-2053, 3.1-2054, 3.1-2055, 3.1-2056, 3.1-2057, 3.1-2058, 3.1-2059, 3.1-2060, 3.1-2061, 3.1-2062, 3.1-2063, 3.1-2064, 3.1-2065, 3.1-2068 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For Training rules and guidelines, click here


NVIDIA Performance on MLPerf 3.0’s Training HPC Benchmarks: Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
PyTorchCosmoFlow2.1Mean average error 0.124512x H100eos3.0-8006MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsH100-SXM5-80GB
DeepCAM0.8IOU 0.822,048x H100eos3.0-8007MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)H100-SXM5-80GB
OpenCatalyst10.7Forces mean absolute error 0.036640x H100eos3.0-8008MixedOpen Catalyst 2020 (OC20) S2EF 2M training split, ID validation setH100-SXM5-80GB
OpenFold7.5Local Distance Difference Test (lDDT-Cα) >= 0.82,080x H100eos3.0-8009MixedOpenProteinSet and Protein Data BankH100-SXM5-80GB

MLPerf™ v3.0 Training HPC Closed: 3.0-8006, 3.0-8007, 3.0-8008, 3.0-8009 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v3.0 Training HPC rules and guidelines, click here



LLM Training Performance on NVIDIA Data Center Products



H100 Training Performance



Framework Framework Version Network Throughput GPU Server Container Sequence Length TP PP Precision Batch Size Dataset GPU Version
Nemo1.19.0GPT3 5B1,290,000 tokens/sec64x H100DGX H10023.05-py3204811FP816PILEH100 SXM5-80GB
1.19.0GPT3 20B311,000 tokens/sec64x H100DGX H10023.05-py3204841FP84PILEH100 SXM5-80GB
1.19.0GPT3 175B86,100 tokens/sec128x H100DGX H10023.05-py3204848FP82PILEH100 SXM5-80GB
1.21.0Llama2 7B51,100 tokens/sec8x H100DGX H10023.08-py3409611FP816PILEH100 SXM5-80GB
1.21.0Llama2 13B50,900 tokens/sec16x H100DGX H10023.08-py3409621FP88PILEH100 SXM5-80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism

H100 Multi-Node Scaling Training Performance



Framework Framework Version Network Throughput Number of Nodes GPU Server Container Sequence Length TP PP Precision Batch Size Dataset GPU Version
Nemo1.19.0GPT3 175B86,400 tokens/sec16128x H100DGX H10023.05-py3204848FP82PILEH100 SXM5-80GB
1.19.0GPT3 175B173,000 tokens/sec32256x H100DGX H10023.05-py3204848FP82PILEH100 SXM5-80GB
1.19.0GPT3 175B342,000 tokens/sec64512x H100DGX H10023.05-py3204848FP82PILEH100 SXM5-80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism


Converged Training Performance on NVIDIA Data Center GPUs


H100 Training Performance



Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.1.0a0Tacotron261.56 Training Loss494,202 total output mels/sec8x H100DGX H10023.07-py3Mixed128LJSpeech 1.1H100 SXM5-80GB
2.1.0a0WaveGlow119-5.7 Training Loss3,631,002 output samples/sec8x H100DGX H10023.08-py3Mixed10LJSpeech 1.1H100 SXM5-80GB
2.1.0a0GNMT v21124.36 BLEU Score1,679,675 total tokens/sec8x H100DGX H10023.08-py3Mixed128wmt16-en-deH100 SXM5-80GB
2.1.0a0Transformer10727.77 BLEU Score945,566 Tokens per Second8x H100DGX H10023.08-py3Mixed10240wmt14-en-deH100 SXM5-80GB
2.1.0a0EfficientNet-B41,62281.82 Top 15,375 images/sec8x H100DGX H10023.08-py3Mixed128Imagenet2012H100 SXM5-80GB
2.1.0a0EfficientDet-D0327.33 BBOX mAP2,670 images/sec8x H100DGX H10023.08-py3Mixed150COCO 2017H100 SXM5-80GB
2.1.0a0EfficientNet-WideSE-B41,63982. Top 15,323 images/sec8x H100DGX H10023.08-py3Mixed128Imagenet2012H100 SXM5-80GB
2.1.0a0HiFiGAN1,0359.37 Training Loss105,894 total output mels/sec8x H100DGX H10023.07-py3Mixed16LJSpeech-1.1H100-SXM5-80GB
Tensorflow2.12.0U-Net Medical1.89 DICE Score2,238 images/sec8x H100DGX H10023.07-py3Mixed8EM segmentation challengeH100 SXM5-80GB
2.13.0Electra Fine Tuning292.57 F15,367 sequences/sec8x H100DGX H10023.08-py3Mixed32SQuaD v1.1H100 SXM5-80GB
2.12.0Wide and Deep4.66 MAP at 1212,780,170 samples/sec8x H100DGX H10023.07-py3Mixed16384Kaggle Outbrain Click PredictionH100 SXM5-80GB
Nemo-ViT g/142,13770.2 Top 11,610 images/sec8x H100DGX H10023.06-py3BF168192ImageNet2012H100-SXM5-80GB
-ViT H/145,45075. Top 12,359 images/sec8x H100DGX H10023.06-py3BF168192ImageNet2012H100-SXM5-80GB

A40 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.0.0a0NCF1.96 Hit Rate at 1049,380,246 samples/sec8x A40GIGABYTE G482-Z52-0023.05-py3Mixed131072MovieLens 20MA40
2.0.0a0Tacotron2112.56 Training Loss271,434 total output mels/sec8x A40Supermicro AS -4124GS-TNR23.03-py3Mixed128LJSpeech 1.1A40
2.1.0a0WaveGlow428-5.94 Training Loss986,709 output samples/sec8x A40Supermicro AS -4124GS-TNR23.07-py3Mixed10LJSpeech 1.1A40
2.1.0a0GNMT v24624.3 BLEU Score329,065 total tokens/sec8x A40Supermicro AS -4124GS-TNR23.07-py3Mixed128wmt16-en-deA40
2.1.0a0FastPitch141.17 Training Loss617,166 frames/sec8x A40Supermicro AS -4124GS-TNR23.07-py3Mixed32LJSpeech 1.1A40
2.1.0a0EfficientNet-B46,39081.93 Top 11,360 images/sec8x A40Supermicro AS -4124GS-TNR23.07-py3Mixed64Imagenet2012A40
2.1.0a0EfficientNet-WideSE-B085577.08 Top 110,467 images/sec8x A40Supermicro AS -4124GS-TNR23.07-py3Mixed256Imagenet2012A40
2.1.0a0EfficientDet-D0708.34 BBOX mAP1,112 images/sec8x A40Supermicro AS -4124GS-TNR23.07-py3Mixed60COCO 2017A40
Tensorflow2.12.0SIM1.79 AUC2,450,911 samples/sec8x A40GIGABYTE G482-Z52-0023.07-py3Mixed16384Amazon ReviewsA40

A30 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.1.0a0Tacotron2131.53 Training Loss236,272 total output mels/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed104LJSpeech 1.1A30
2.1.0a0WaveGlow411-5.73 Training Loss1,030,817 output samples/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed10LJSpeech 1.1A30
2.1.0a0GNMT v24924.14 BLEU Score310,790 total tokens/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed128wmt16-en-deA30
2.0.0NCF1.96 Hit Rate at 1051,737,678 samples/sec8x A30GIGABYTE G482-Z52-0023.05-py3Mixed131072MovieLens 20MA30
2.1.0a0FastPitch159.17 Training Loss531,205 frames/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed16LJSpeech 1.1A30
2.1.0a0EfficientDet-D0933.34 BBOX mAP796 images/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed30COCO 2017A30
2.1.0a0EfficientNet-WideSE-B082377.25 Top 110,782 images/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed128Imagenet2012A30
Tensorflow2.13.0U-Net Medical4.89 DICE Score473 images/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed8EM segmentation challengeA30
2.13.0Electra Fine Tuning592.8 F11,019 sequences/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed16SQuaD v1.1A30
2.13.0SIM1.81 AUC2,573,991 samples/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed16384Amazon ReviewsA30

A10 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.1.0a0Tacotron2142.53 Training Loss217,183 total output mels/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed104LJSpeech 1.1A10
2.1.0a0GNMT v25424.17 BLEU Score281,565 total tokens/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed128wmt16-en-deA10
2.0.0NCF1.96 Hit Rate at 1042,446,575 samples/sec8x A10GIGABYTE G482-Z52-0023.05-py3Mixed131072MovieLens 20MA10
2.1.0a0FastPitch189.17 Training Loss439,357 frames/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed16LJSpeech 1.1A10
2.1.0a0EfficientDet-D0964.34 BBOX mAP720 images/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed30COCO 2017A10
2.1.0a0EfficientNet-WideSE-B01,08077.19 Top 18,351 images/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed128Imagenet2012A10
Tensorflow2.13.0U-Net Medical3.89 DICE Score364 images/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed8EM segmentation challengeA10
2.13.0Electra Fine Tuning592.78 F1790 sequences/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed16SQuaD v1.1A10
2.13.0SIM1.79 AUC2,313,420 samples/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed16384Amazon ReviewsA10


Converged Training Performance of NVIDIA Data Center GPUs on Cloud

A100 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
MXNet-ResNet-50 v1.57877.18 Top 125,491 images/sec8x A100Azure Standard_ND96amsr_A100_v423.06-py3Mixed256ImageNet2012A100-SXM4-80GB
-ResNet-50 v1.58577.06 Top 123,159 images/sec8x A100GCP A2-HIGHGPU-8G23.08-py3Mixed256ImageNet2012A100-SXM4-40GB


View More Performance Data

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More