AI Training

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.


Click here to view other performance data.


NVIDIA Performance on MLPerf 4.1 Training Benchmarks


NVIDIA Performance on MLPerf 4.1’s AI Benchmarks: Single Node, Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
NVIDIA NemoLLama2-70B-lora12.90.925 cross entropy loss8x B200dgx_b200_preview4.1-0027MixedSCROLLs GovReportNVIDIA HGX B200
24.10.925 cross entropy loss8x H200NVIDIA H2004.1-0022MixedSCROLLs GovReportH200-SXM5-141GB
27.90.925 cross entropy loss8x H100Eos4.1-0002MixedSCROLLs GovReportH100-SXM5-80GB
NVIDIA DGLR-GAT5.572.0 % classification8x B200dgx_b200_preview4.1-0025MixedIGBH-FullNVIDIA HGX B200
7.772.0 % classification8x H200NVIDIA H2004.1-0018MixedIGBH-FullH200-SXM5-141GB
11.272.0 % classification8x H100Eos4.1-0000MixedIGBH-FullH100-SXM5-80GB
NVIDIA Merlin HugeCTRDLRM-dcnv22.40.80275 AUC8x B200dgx_b200_preview4.1-0026MixedCriteo 3.5TB Click LogsNVIDIA HGX B200
3.50.80275 AUC8x H200NVIDIA H2004.1-0019MixedCriteo 3.5TB Click LogsH200-SXM5-141GB
3.90.80275 AUC8x H100Eos4.1-0001MixedCriteo 3.5TB Click LogsH100-SXM5-80GB
NVIDIA NemoStable Diffusion v2.019.5FID⇐90 and CLIP>=0.158x B200dgx_b200_preview4.1-0027MixedLAION-400M-filteredNVIDIA HGX B200
30.5FID⇐90 and CLIP>=0.158x H200NVIDIA H2004.1-0022MixedLAION-400M-filteredH200-SXM5-141GB
33.9FID⇐90 and CLIP>=0.158x H100Eos4.1-0002MixedLAION-400M-filteredH100-SXM5-80GB
PyTorchBERT3.80.72 Mask-LM accuracy8x B200dgx_b200_preview4.1-0028MixedWikipedia 2020/01/01NVIDIA HGX B200
5.20.72 Mask-LM accuracy8x H200NVIDIA H2004.1-0020MixedWikipedia 2020/01/01H200-SXM5-141GB
5.50.72 Mask-LM accuracy8x H100Eos4.1-0004MixedWikipedia 2020/01/01H100-SXM5-80GB
PyTorchRetinaNet22.534.0% mAP8x B200dgx_b200_preview4.1-0028MixedSubset of OpenImagesNVIDIA HGX B200
34.334.0% mAP8x H200NVIDIA H2004.1-0021MixedSubset of OpenImagesH200-SXM5-141GB
35.734.0% mAP8x H100Eos4.1-0003MixedSubset of OpenImagesH100-SXM5-80GB

NVIDIA Performance on MLPerf 4.1’s AI Benchmarks: Multi Node, Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
NVIDIA NeMoGPT3193.72.69 log perplexity64x B200dgx_b200_preview_n84.1-0029Mixedc4/en/3.0.1NVIDIA HGX B200
96.72.69 log perplexity256x H100Eos_n324.1-0009Mixedc4/en/3.0.1H100-SXM5-80GB
49.82.69 log perplexity512x H100Eos_n644.1-0012Mixedc4/en/3.0.1H100-SXM5-80GB
3.42.69 log perplexity11,616x H100Eos-dfw_n14524.1-0024Mixedc4/en/3.0.1H100-SXM5-80GB
NVIDIA NeMoLLama2-70B-lora4.60.925 cross entropy loss64x H100Eos_n84.1-0015MixedSCROLLs GovReportH100-SXM5-80GB
1.20.925 cross entropy loss1,024x H100Eos_n1284.1-0006MixedSCROLLs GovReportH100-SXM5-80GB
DGLR-GAT2.172.0 % classification64x H100Eos_n84.1-0013MixedIGBH-FullH100-SXM5-80GB
0.972.0 % classification512x H100Eos_n644.1-0011MixedIGBH-FullH100-SXM5-80GB
NVIDIA Merlin HugeCTRDLRM-dcnv21.30.80275 AUC64x H100Eos_n84.1-0014MixedCriteo 3.5TB Click LogsH100-SXM5-80GB
10.80275 AUC128x H100Eos_n164.1-0007MixedCriteo 3.5TB Click LogsH100-SXM5-80GB
NVIDIA NeMoStable Diffusion v2.06.1FID⇐90 and CLIP>=0.1564x H100Eos_n84.1-0015MixedLAION-400M-filteredH100-SXM5-80GB
1.7FID⇐90 and CLIP>=0.15512x H100Eos_n644.1-0012MixedLAION-400M-filteredH100-SXM5-80GB
1.4FID⇐90 and CLIP>=0.151,024x H100Eos_n1284.1-0005MixedLAION-400M-filteredH100-SXM5-80GB
PyTorchBERT0.90.72 Mask-LM accuracy64x H100Eos_n84.1-0016MixedWikipedia 2020/01/01H100-SXM5-80GB
0.10.72 Mask-LM accuracy3,472x H100Eos_n4344.1-0010MixedWikipedia 2020/01/01H100-SXM5-80GB
PyTorchRetinaNet634.0% mAP64x H100Eos_n84.1-0017MixedSubset of OpenImagesH100-SXM5-80GB
0.834.0% mAP2,528x H100Eos_n3164.1-0008MixedSubset of OpenImagesH100-SXM5-80GB

MLPerf™ v4.1 Training Closed: 4.1-0000, 4.1-0001, 4.1-0002, 4.1-0003, 4.1-0004, 4.1-0005, 4.1-0006, 4.1-0007, 4.1-0008, 4.1-0009, 4.1-0010, 4.1-0011, 4.1-0012, 4.1-0013, 4.1-0014, 4.1-0015, 4.1-0016, 4.1-0017, 4.1-0018, 4.1-0019, 4.1-0020, 4.1-0021, 4.1-0022, 4.1-0024, 4.1-0025, 4.1-0026, 4.1-0027, 4.1-0028, 4.1-0029 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For Training rules and guidelines, click here
B200 results are preview submissions


NVIDIA Performance on MLPerf 3.0’s Training HPC Benchmarks: Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
PyTorchCosmoFlow2.1Mean average error 0.124512x H100eos3.0-8006MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsH100-SXM5-80GB
DeepCAM0.8IOU 0.822,048x H100eos3.0-8007MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)H100-SXM5-80GB
OpenCatalyst10.7Forces mean absolute error 0.036640x H100eos3.0-8008MixedOpen Catalyst 2020 (OC20) S2EF 2M training split, ID validation setH100-SXM5-80GB
OpenFold7.5Local Distance Difference Test (lDDT-Cα) >= 0.82,080x H100eos3.0-8009MixedOpenProteinSet and Protein Data BankH100-SXM5-80GB

MLPerf™ v3.0 Training HPC Closed: 3.0-8006, 3.0-8007, 3.0-8008, 3.0-8009 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v3.0 Training HPC rules and guidelines, click here



LLM Training Performance on NVIDIA Data Center Products


H100 Training Performance



Framework Model Time to Train (days) Throughput per GPU GPU Server Container Version Sequence Length TP PP CP Precision Global Batch Size GPU Version
NemoLlama3.1 405B36314 tokens/sec576x H100Eosnemo:24.098192892FP8252H100 SXM5 80GB
Llama3 8B0.813,443 tokens/sec8x H100Eosnemo:24.098192112FP8128H100 SXM5 80GB
Llama3 70B7.31,557 tokens/sec64x H100Eosnemo:24.098192442FP8128H100 SXM5 80GB
Nemotron 8B0.912,701 tokens/sec64x H100Eosnemo:24.094096211FP8256H100 SXM5-80GB
Nemotron 15B1.57,516 tokens/sec64x H100Eosnemo:24.094096411FP8256H100 SXM5 80GB
Nemotron 22B2.34,980 tokens/sec64x H100Eosnemo:24.094096241FP8256H100 SXM5 80GB
Nemotron 340B32.7346 tokens/sec128x H100Eosnemo:24.094096881FP832H100 SXM5 80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism
CP: Context Parallelism
Time to Train is estimated time to train on 1T tokens with 1K GPUs


Converged Training Performance on NVIDIA Data Center GPUs


H200 Training Performance



Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.4.0a0Tacotron265.56 Training Loss496,465 total output mels/sec8x H200DGX H20024.12-py3TF32128LJSpeech 1.1NVIDIA H200
2.4.0a0WaveGlow106-5.7 Training Loss4,124,433 output samples/sec8x H200DGX H20024.12-py3Mixed10LJSpeech 1.1NVIDIA H200
2.4.0a0NCF.96 Hit Rate at 10252,318,096 samples/sec8x H200DGX H20024.12-py3Mixed131072MovieLens 20MNVIDIA H200
2.4.0a0FastPitch66.17 Training Loss1,465,568 frames/sec8x H200DGX H20024.12-py3TF3232LJSpeech 1.1NVIDIA H200
2.4.0a0Transformer XL Large26417.82 Perplexity317,663 total tokens/sec8x H200DGX H20024.12-py3Mixed16WikiText-103NVIDIA H200
2.4.0a0Transformer XL Base11621.6 Perplexity1,163,450 total tokens/sec8x H200DGX H20024.12-py3Mixed128WikiText-103NVIDIA H200
2.4.0a0EfficientDet-D0303.33 BBOX mAP2,793 images/sec8x H200DGX H20024.12-py3Mixed150COCO 2017NVIDIA H200
2.4.0a0HiFiGAN9159.75 Training Loss120,606 total output mels/sec8x H200DGX H20024.12-py3Mixed16LJSpeech-1.1NVIDIA H200

H100 Training Performance



Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.4.0a0Tacotron2. Training Loss477,113 total output mels/sec8x H100DGX H10024.12-py3Mixed128LJSpeech 1.1H100-SXM5-80GB
2.4.0a0WaveGlow. Training Loss3,809,464 output samples/sec8x H100DGX H10024.12-py3Mixed10LJSpeech 1.1H100-SXM5-80GB
2.4.0a0NCF. Hit Rate at 10212,174,107 samples/sec8x H100DGX H10024.12-py3TF32131072MovieLens 20MH100-SXM5-80GB
2.4.0a0FastPitch. Training Loss1,431,758 frames/sec8x H100DGX H10024.12-py3TF3232LJSpeech 1.1H100-SXM5-80GB

A30 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.4.0a0Tacotron2129.53 Training Loss237,526 total output mels/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed104LJSpeech 1.1NVIDIA A30
2.4.0a0WaveGlow402-5.88 Training Loss1,047,359 output samples/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed10LJSpeech 1.1NVIDIA A30
2.4.0a0GNMT v24924.23 BLEU Score306,590 total tokens/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed128wmt16-en-deNVIDIA A30
2.4.0a0NCF1.96 Hit Rate at 1041,902,951 samples/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed131072MovieLens 20MNVIDIA A30
2.4.0a0FastPitch153.17 Training Loss547,338 frames/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed16LJSpeech 1.1NVIDIA A30
2.4.0a0Transformer XL Base19622.82 Perplexity168,548 total tokens/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed32WikiText-103NVIDIA A30
2.4.0a0EfficientNet-B078577.15 Top 111,335 images/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed128Imagenet2012NVIDIA A30
2.4.0a0EfficientNet-WideSE-B080077.08 Top 111,029 images/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed128Imagenet2012NVIDIA A30
2.4.0a0MoFlow9986.8 NUV12,284 molecules/sec8x A30GIGABYTE G482-Z52-0024.09-py3Mixed512ZINCA30

A10 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.4.0a0Tacotron2145.53 Training Loss210,315 total output mels/sec8x A10GIGABYTE G482-Z52-0024.09-py3Mixed104LJSpeech 1.1NVIDIA A10
2.4.0a0WaveGlow543-5.8 Training Loss776,028 output samples/sec8x A10GIGABYTE G482-Z52-0024.09-py3Mixed10LJSpeech 1.1NVIDIA A10
2.4.0a0GNMT v25724.29 BLEU Score262,936 total tokens/sec8x A10GIGABYTE G482-Z52-0024.09-py3Mixed128wmt16-en-deNVIDIA A10
2.4.0a0NCF2.96 Hit Rate at 1033,005,044 samples/sec8x A10GIGABYTE G482-Z52-0024.09-py3TF32131072MovieLens 20MNVIDIA A10
2.4.0a0FastPitch180.17 Training Loss462,052 frames/sec8x A10GIGABYTE G482-Z52-0024.09-py3Mixed16LJSpeech 1.1NVIDIA A10
2.4.0a0Transformer XL Base26222.82 Perplexity126,073 total tokens/sec8x A10GIGABYTE G482-Z52-0024.09-py3Mixed32WikiText-103NVIDIA A10
2.4.0a0EfficientNet-B01,03577.06 Top 18,508 images/sec8x A10GIGABYTE G482-Z52-0024.09-py3Mixed128Imagenet2012NVIDIA A10
2.4.0a0EfficientNet-WideSE-B01,06177.23 Top 18,301 images/sec8x A10GIGABYTE G482-Z52-0024.09-py3Mixed128Imagenet2012NVIDIA A10
2.4.0a0MoFlow10088.14 NUV12,237 images/sec8x A10GIGABYTE G482-Z52-0024.09-py3Mixed512Medical Segmentation DecathlonNVIDIA A10



View More Performance Data

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More