AI Training

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.


Click here to view other performance data.


NVIDIA Performance on MLPerf 4.0 Training Benchmarks


NVIDIA Performance on MLPerf 4.0’s AI Benchmarks: Single Node, Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
NVIDIA NemoLLama2-70B-lora24.70.925 cross entropy loss8x H200NVIDIA H2004.0-0071MixedSCROLLs GovReportH200-SXM5-141GB
28.20.925 cross entropy loss8x H100Eos4.0-0050MixedSCROLLs GovReportH100-SXM5-80GB
DGLR-GAT7.772.0 % classification8x H200NVIDIA H2004.0-0068MixedIGBH-FullH200-SXM5-141GB
11.372.0 % classification8x H100Eos4.0-0047MixedIGBH-FullH100-SXM5-80GB
NVIDIA Merlin HugeCTRDLRM-dcnv23.50.80275 AUC8x H200NVIDIA H2004.0-0070MixedCriteo 4TBH200-SXM5-141GB
3.90.80275 AUC8x H100Eos4.0-0049MixedCriteo 4TBH100-SXM5-80GB
NVIDIA NemoStable Diffusion v2.041.3FID⇐90 and CLIP>=0.158x H200NVIDIA H2004.0-0071MixedLAION-400M-filteredH200-SXM5-141GB
42.2FID⇐90 and CLIP>=0.158x H100Eos4.0-0050MixedLAION-400M-filteredH100-SXM5-80GB
PyTorchBERT5.20.72 Mask-LM accuracy8x H200NVIDIA H2004.0-0072MixedWikipedia 2020/01/01H200-SXM5-141GB
5.50.72 Mask-LM accuracy8x H100Eos4.0-0052MixedWikipedia 2020/01/01H100-SXM5-80GB
PyTorchRetinaNet34.334.0% mAP8x H200NVIDIA H2004.0-0073MixedSubset of OpenImagesH200-SXM5-141GB
35.534.0% mAP8x H100Eos4.0-0051MixedSubset of OpenImagesH100-SXM5-80GB
MXNet3D U-Net11.50.908 Mean DICE score8x H200NVIDIA H2004.0-0069MixedKiTS19H200-SXM5-141GB
12.10.908 Mean DICE score8x H100Eos4.0-0048MixedKiTS19H100-SXM5-80GB
MXNetResNet-50 v1.512.175.90% classification8x H200NVIDIA H2004.0-0069MixedImageNetH200-SXM5-141GB
13.375.90% classification8x H100Eos4.0-0048MixedImageNetH100-SXM5-80GB

NVIDIA Performance on MLPerf 4.0’s AI Benchmarks: Multi Node, Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
NVIDIA NeMoGPT350.72.69 log perplexity512x H100Eos_n644.0-0059Mixedc4/en/3.0.1H100-SXM5-80GB
3.72.69 log perplexity10,752x H100Eos-dfw_n13444.0-0006Mixedc4/en/3.0.1H100-SXM5-80GB
3.42.69 log perplexity11,616x H100Eos-dfw_n14524.0-0007Mixedc4/en/3.0.1H100-SXM5-80GB
NVIDIA NeMoLLama2-70B-lora5.30.925 cross entropy loss64x H100Eos_n84.0-0063MixedSCROLLs GovReportH100-SXM5-80GB
1.50.925 cross entropy loss1,024x H100Eos_n1284.0-0053MixedSCROLLs GovReportH100-SXM5-80GB
DGLR-GAT2.772.0 % classification64x H100Eos_n84.0-0060MixedIGBH-FullH100-SXM5-80GB
1.172.0 % classification512x H100Eos_n644.0-0058MixedIGBH-FullH100-SXM5-80GB
NVIDIA Merlin HugeCTRDLRM-dcnv21.40.80275 AUC64x H100Eos_n84.0-0062MixedCriteo 4TBH100-SXM5-80GB
1.00.80275 AUC128x H100Eos_n164.0-0054MixedCriteo 4TBH100-SXM5-80GB
NVIDIA NeMoStable Diffusion v2.06.7FID⇐90 and CLIP>=0.1564x H100Eos_n84.0-0063MixedLAION-400M-filteredH100-SXM5-80GB
1.8FID⇐90 and CLIP>=0.15512x H100Eos_n644.0-0059MixedLAION-400M-filteredH100-SXM5-80GB
1.4FID⇐90 and CLIP>=0.151,024x H100Eos_n1284.0-0053MixedLAION-400M-filteredH100-SXM5-80GB
PyTorchBERT0.90.72 Mask-LM accuracy64x H100Eos_n84.0-0064MixedWikipedia 2020/01/01H100-SXM5-80GB
0.10.72 Mask-LM accuracy3,472x H100Eos_n4344.0-0057MixedWikipedia 2020/01/01H100-SXM5-80GB
PyTorchRetinaNet6.134.0% mAP64x H100Eos_n84.0-0065MixedSubset of OpenImagesH100-SXM5-80GB
0.834.0% mAP2,528x H100Eos_n3164.0-0056MixedSubset of OpenImagesH100-SXM5-80GB
MXNet3D U-Net1.90.908 Mean DICE score72x H100Eos_n94.0-0066MixedKiTS19H100-SXM5-80GB
0.80.908 Mean DICE score768x H100Eos_n964.0-0067MixedKiTS19H100-SXM5-80GB
MXNetResNet-50 v1.52.575.90% classification64x H100Eos_n84.0-0061MixedImageNetH100-SXM5-80GB
0.275.90% classification3,584x H100NVIDIA+CoreWeave Joint Submission4.0-0008MixedImageNetH100-SXM5-80GB

MLPerf™ v4.0 Training Closed: 4.0-0006, 4.0-0007, 4.0-0008, 4.0-0047, 4.0-0048, 4.0-0049, 4.0-0050, 4.0-0051, 4.0-0052, 4.0-0053, 4.0-0054, 4.0-0055, 4.0-0056, 4.0-0057, 4.0-0058, 4.0-0059, 4.0-0060, 4.0-0061, 4.0-0062, 4.0-0063, 4.0-0064, 4.0-0065, 4.0-0066, 4.0-0067, 4.0-0068, 4.0-0069, 4.0-0070, 4.0-0071, 4.0-0072, 4.0-0073 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For Training rules and guidelines, click here


NVIDIA Performance on MLPerf 3.0’s Training HPC Benchmarks: Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
PyTorchCosmoFlow2.1Mean average error 0.124512x H100eos3.0-8006MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsH100-SXM5-80GB
DeepCAM0.8IOU 0.822,048x H100eos3.0-8007MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)H100-SXM5-80GB
OpenCatalyst10.7Forces mean absolute error 0.036640x H100eos3.0-8008MixedOpen Catalyst 2020 (OC20) S2EF 2M training split, ID validation setH100-SXM5-80GB
OpenFold7.5Local Distance Difference Test (lDDT-Cα) >= 0.82,080x H100eos3.0-8009MixedOpenProteinSet and Protein Data BankH100-SXM5-80GB

MLPerf™ v3.0 Training HPC Closed: 3.0-8006, 3.0-8007, 3.0-8008, 3.0-8009 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v3.0 Training HPC rules and guidelines, click here



LLM Training Performance on NVIDIA Data Center Products


H100 Training Performance



Framework Model Time to Train (days) Throughput per GPU GPU Server Container Version Sequence Length TP PP CP Precision Global Batch Size GPU Version
NemoGPT3 5B0.523,117 tokens/sec64x H100Eosnemo:24.052,048111FP82,048H100 SXM5 80GB
GPT3 20B25,611 tokens/sec64x H100Eosnemo:24.052,048211FP8256H100 SXM5 80GB
Llama2 7B0.716,154 tokens/sec8x H100Eosnemo:24.054,096111FP8128H100 SXM5 80GB
Llama2 13B1.48,344 tokens/sec16x H100Eosnemo:24.054,096141FP8128H100 SXM5 80GB
Llama2 70B6.81,659 tokens/sec64x H100Eosnemo:24.054,096441FP8128H100 SXM5 80GB
Llama3 8B111,879 tokens/sec8x H100Eosnemo:24.058,192112FP8128H100 SXM5 80GB
Llama3 70B7.81,444 tokens/sec64x H100Eosnemo:24.058,192442FP8128H100 SXM5 80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism
CP: Context Parallelism
Time to Train is estimated time to train on 1T tokens with 1K GPUs


Converged Training Performance on NVIDIA Data Center GPUs


H200 Training Performance



Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.4.0a0Tacotron262.54 Training Loss514,893 total output mels/sec8x H200DGX H20024.06-py3TF32128LJSpeech 1.1NVIDIA H200
2.4.0a0WaveGlow110-5.7 Training Loss3,974,080 output samples/sec8x H200DGX H20024.06-py3Mixed10LJSpeech 1.1NVIDIA H200
2.4.0a0GNMT v2924.26 BLEU Score1,870,930 total tokens/sec8x H200DGX H20024.06-py3Mixed128wmt16-en-deNVIDIA H200
2.4.0a0NCF.96 Hit Rate at 10244,942,025 samples/sec8x H200DGX H20024.06-py3Mixed131072MovieLens 20MNVIDIA H200
2.4.0a0FastPitch72.17 Training Loss1,350,880 frames/sec8x H200DGX H20024.06-py3TF3232LJSpeech 1.1NVIDIA H200
2.4.0a0Transformer XL Large27717.87 Perplexity301,765 total tokens/sec8x H200DGX H20024.06-py3Mixed16WikiText-103NVIDIA H200
2.4.0a0Transformer XL Base12221.58 Perplexity1,100,513 total tokens/sec8x H200DGX H20024.06-py3Mixed128WikiText-103NVIDIA H200
2.4.0a0EfficientNet-B410182. Top 16,030 images/sec8x H200DGX H20024.06-py3Mixed128Imagenet2012NVIDIA H200
2.4.0a0EfficientDet-D0307.33 BBOX mAP2,755 images/sec8x H200DGX H20024.06-py3Mixed150COCO 2017NVIDIA H200
2.4.0a0EfficientNet-WideSE-B41,45182.21 Top 16,024 images/sec8x H200DGX H20024.06-py3Mixed128Imagenet2012NVIDIA H200
2.4.0a0TFT-Electricity2.03 Test P90158,366 items/sec8x H200DGX H20024.06-py3Mixed1024ElectricityNVIDIA H200
2.4.0a0HiFiGAN9119.22 Training Loss119,911 total output mels/sec8x H200DGX H20024.06-py3Mixed16LJSpeech-1.1NVIDIA H200
2.4.0a0GPUNet-01,05478.86 Top 19,934 images/sec8x H200DGX H20024.06-py3Mixed192Imagenet2012NVIDIA H200
2.4.0a0GPUNet-196380.33 Top 110,905 images/sec8x H200DGX H20024.06-py3Mixed192Imagenet2012NVIDIA H200
Tensorflow2.16.1U-Net Medical2.89 DICE Score2,356 images/sec8x H200DGX H20024.06-py3Mixed8EM segmentation challengeNVIDIA H200
2.16.1Wide and Deep4.66 MAP at 1211,859,112 samples/sec8x H200DGX H20024.06-py3Mixed16384Kaggle Outbrain Click PredictionNVIDIA H200

H100 Training Performance



Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.4.0a0Tacotron267.56 Training Loss473,451 total output mels/sec8x H100DGX H10024.06-py3Mixed128LJSpeech 1.1H100-SXM5-80GB
2.4.0a0WaveGlow116-5.73 Training Loss3,738,190 output samples/sec8x H100DGX H10024.06-py3Mixed10LJSpeech 1.1H100-SXM5-80GB
2.4.0a0GNMT v2924.11 BLEU Score1,710,731 total tokens/sec8x H100DGX H10024.06-py3Mixed128wmt16-en-deH100-SXM5-80GB
2.4.0a0NCF.96 Hit Rate at 10219,720,903 samples/sec8x H100DGX H10024.06-py3Mixed131072MovieLens 20MH100-SXM5-80GB
2.4.0a0FastPitch72.17 Training Loss1,364,338 frames/sec8x H100DGX H10024.06-py3TF3232LJSpeech 1.1H100-SXM5-80GB
2.4.0a0Transformer XL Large31817.83 Perplexity261,789 total tokens/sec8x H100DGX H10024.06-py3Mixed16WikiText-103H100-SXM5-80GB
2.4.0a0Transformer XL Base14021.58 Perplexity957,832 total tokens/sec8x H100DGX H10024.06-py3Mixed128WikiText-103H100-SXM5-80GB
2.4.0a0EfficientNet-B41,65881.92 Top 15,251 images/sec8x H100DGX H10024.06-py3Mixed128Imagenet2012H100-SXM5-80GB
2.4.0a0EfficientDet-D0317.33 BBOX mAP2,630 images/sec8x H100DGX H10024.06-py3Mixed150COCO 2017H100-SXM5-80GB
2.4.0a0EfficientNet-WideSE-B41,66882.28 Top 15,223 images/sec8x H100DGX H10024.06-py3Mixed128Imagenet2012H100-SXM5-80GB
2.4.0a0HiFiGAN9449.67 Training Loss116,668 total output mels/sec8x H100DGX H10024.06-py3Mixed16LJSpeech-1.1H100 SXM5-80GB
2.4.0a0GPUNet-01,06278.69 Top 19,856 images/sec8x H100DGX H10024.06-py3Mixed192Imagenet2012H100-SXM5-80GB
2.4.0a0GPUNet-195680.29 Top 110,981 images/sec8x H100DGX H10024.06-py3Mixed192Imagenet2012H100-SXM5-80GB
2.4.0a0MoFlow3586.9 NUV45,008 molecules/sec8x H100DGX H10024.06-py3Mixed512ZINCH100 SXM5-80GB
Tensorflow2.16.1U-Net Medical1.89 DICE Score2,061 images/sec8x H100DGX H10024.06-py3Mixed8EM segmentation challengeH100 SXM5-80GB
2.15.0Electra Fine Tuning. F1 sequences/sec8x H100DGX H10024.02-py3Mixed32SQuaD v1.1H100 SXM5-80GB
2.16.1Wide and Deep4.66 MAP at 1210,746,049 samples/sec8x H100DGX H10024.06-py3Mixed16384Kaggle Outbrain Click PredictionH100 SXM5-80GB

A30 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.4.0a0Tacotron2134.51 Training Loss223,242 total output mels/sec8x A30GIGABYTE G482-Z52-0024.06-py3Mixed104LJSpeech 1.1A30
2.4.0a0WaveGlow400-5.76 Training Loss1,055,135 output samples/sec8x A30GIGABYTE G482-Z52-0024.06-py3Mixed10LJSpeech 1.1A30
2.4.0a0GNMT v25824.3 BLEU Score314,472 total tokens/sec8x A30GIGABYTE G482-Z52-0024.06-py3Mixed128wmt16-en-deA30
2.4.0a0NCF1.96 Hit Rate at 1041,874,445 samples/sec8x A30GIGABYTE G482-Z52-0024.06-py3Mixed131072MovieLens 20MA30
2.4.0a0FastPitch154.17 Training Loss548,158 frames/sec8x A30GIGABYTE G482-Z52-0024.06-py3Mixed16LJSpeech 1.1A30
2.4.0a0Transformer XL Base19822.87 Perplexity168,143 total tokens/sec8x A30GIGABYTE G482-Z52-0024.06-py3Mixed32WikiText-103A30
2.4.0a0EfficientNet-B078277.02 Top 111,319 images/sec8x A30GIGABYTE G482-Z52-0024.06-py3Mixed128Imagenet2012A30
2.4.0a0EfficientNet-WideSE-B080577.17 Top 111,038 images/sec8x A30GIGABYTE G482-Z52-0024.06-py3Mixed128Imagenet2012A30
2.4.0a0MoFlow10293.96 NUV11,986 molecules/sec8x A30GIGABYTE G482-Z52-0024.06-py3Mixed512ZINCA30
Tensorflow2.16.1U-Net Medical4.89 DICE Score475 images/sec8x A30GIGABYTE G482-Z52-0024.06-py3Mixed8EM segmentation challengeA30
2.15.0Electra Fine Tuning592.63 F11,024 sequences/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed16SQuaD v1.1A30
SIM. AUC samples/sec8x A30GIGABYTE G482-Z52-0023.12-py3Mixed16384Amazon ReviewsA30

A10 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.4.0a0Tacotron2147.53 Training Loss212,211 total output mels/sec8x A10GIGABYTE G482-Z52-0024.06-py3Mixed104LJSpeech 1.1A10
2.4.0a0WaveGlow506-5.8 Training Loss834,526 output samples/sec8x A10GIGABYTE G482-Z52-0024.06-py3Mixed10LJSpeech 1.1A10
2.4.0a0GNMT v26924.4 BLEU Score262,565 total tokens/sec8x A10GIGABYTE G482-Z52-0024.06-py3Mixed128wmt16-en-deA10
2.4.0a0NCF1.96 Hit Rate at 1035,888,159 samples/sec8x A10GIGABYTE G482-Z52-0024.06-py3Mixed131072MovieLens 20MA10
2.4.0a0FastPitch181.17 Training Loss461,920 frames/sec8x A10GIGABYTE G482-Z52-0024.06-py3Mixed16LJSpeech 1.1A10
2.4.0a0Transformer XL Base28222.8 Perplexity117,636 total tokens/sec8x A10GIGABYTE G482-Z52-0024.06-py3Mixed32WikiText-103A10
2.4.0a0EfficientNet-B01,04177.15 Top 18,465 images/sec8x A10GIGABYTE G482-Z52-0024.06-py3Mixed128Imagenet2012A10
2.4.0a0EfficientNet-WideSE-B01,05677.3 Top 18,350 images/sec8x A10GIGABYTE G482-Z52-0024.06-py3Mixed128Imagenet2012A10
2.4.0a0MoFlow10086.84 NUV12,270 images/sec8x A10GIGABYTE G482-Z52-0024.06-py3Mixed512Medical Segmentation DecathlonA10
Tensorflow2.16.1U-Net Medical4.89 DICE Score382 images/sec8x A10GIGABYTE G482-Z52-0024.06-py3Mixed8EM segmentation challengeA10
2.15.0Electra Fine Tuning592.52 F1826 sequences/sec8x A10GIGABYTE G482-Z52-0024.02-py3Mixed16SQuaD v1.1A10



View More Performance Data

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More