AI Training
Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.
Click here to view other performance data.
NVIDIA Performance on MLPerf 3.1 Training Benchmarks
NVIDIA Performance on MLPerf 3.1’s AI Benchmarks: Single Node, Closed Division
Framework | Network | Time to Train (mins) |
MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|
Nemo | Stable Diffusion | 46.8 | FID⇐90 and and CLIP>=0.15 | 8x H100 | XE9680x8H100-SXM-80GB | 3.1-2019 | Mixed | LAION-400M-filtered | H100-SXM5-80GB |
MXNet | ResNet-50 v1.5 | 13.4 | 75.90% classification | 8x H100 | ESC-N8-E11 | 3.1-2011 | Mixed | ImageNet | H100-SXM5-80GB |
3D U-Net | 13.1 | 0.908 Mean DICE score | 8x H100 | AS-8125GS-TNHR | 3.1-2068 | Mixed | KiTS19 | H100-SXM5-80GB | |
PyTorch | BERT | 5.4 | 0.72 Mask-LM accuracy | 8x H100 | ESC-N8-E11 | 3.1-2011 | Mixed | Wikipedia 2020/01/01 | H100-SXM5-80GB |
Mask R-CNN | 19.2 | 0.377 Box min AP and 0.339 Mask min AP | 8x H100 | Eos_n1 | 3.1-2048 | Mixed | COCO2017 | H100-SXM5-80GB | |
RNN-T | 16.2 | 0.058 Word Error Rate | 8x H100 | GIGABYTE G593-ZD2 | 3.1-2028 | Mixed | LibriSpeech | H100-SXM5-80GB | |
RetinaNet | 36.0 | 34.0% mAP | 8x H100 | ESC-N8-E11 | 3.1-2011 | Mixed | A subset of OpenImages | H100-SXM5-80GB | |
NVIDIA Merlin HugeCTR | DLRM-dcnv2 | 3.9 | 0.80275 AUC | 8x H100 | Eos_n1 | 3.1-2047 | Mixed | Criteo 4TB | H100-SXM5-80GB |
NVIDIA Performance on MLPerf 3.1’s AI Benchmarks: Multi Node, Closed Division
Framework | Network | Time to Train (mins) |
MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|
NVIDIA NeMo | GPT3 | 58.3 | 2.69 log perplexity | 512x H100 | Eos_n64 | 3.1-2057 | Mixed | c4/en/3.0.1 | H100-SXM5-80GB |
40.6 | 2.69 log perplexity | 768x H100 | Eos_n96 | 3.1-2065 | Mixed | c4/en/3.0.1 | H100-SXM5-80GB | ||
8.6 | 2.69 log perplexity | 4,096x H100 | Eos-dfw_n512 | 3.1-2008 | Mixed | c4/en/3.0.1 | H100-SXM5-80GB | ||
6.0 | 2.69 log perplexity | 6,144x H100 | Eos-dfw_n768 | 3.1-2009 | Mixed | c4/en/3.0.1 | H100-SXM5-80GB | ||
4.9 | 2.69 log perplexity | 8,192x H100 | Eos-dfw_n1024 | 3.1-2005 | Mixed | c4/en/3.0.1 | H100-SXM5-80GB | ||
4.1 | 2.69 log perplexity | 10,240x H100 | Eos-dfw_n1280 | 3.1-2006 | Mixed | c4/en/3.0.1 | H100-SXM5-80GB | ||
3.9 | 2.69 log perplexity | 10,752x H100 | Eos-dfw_n1344 | 3.1-2007 | Mixed | c4/en/3.0.1 | H100-SXM5-80GB | ||
Stable Diffusion | 10.0 | FID⇐90 and and CLIP>=0.15 | 64x H100 | Eos_n8 | 3.1-2060 | Mixed | LAION-400M-filtered | H100-SXM5-80GB | |
2.9 | FID⇐90 and and CLIP>=0.15 | 512x H100 | Eos_n64 | 3.1-2055 | Mixed | LAION-400M-filtered | H100-SXM5-80GB | ||
2.5 | FID⇐90 and and CLIP>=0.15 | 1,024x H100 | Eos_n128 | 3.1-2050 | Mixed | LAION-400M-filtered | H100-SXM5-80GB | ||
MXNet | ResNet-50 v1.5 | 2.5 | 75.90% classification | 64x H100 | Eos_n8 | 3.1-2058 | Mixed | ImageNet | H100-SXM5-80GB |
0.2 | 75.90% classification | 3,584x H100 | coreweave_hgxh100_n448_ngc23.04_mxnet | 3.1-2010 | Mixed | ImageNet | H100-SXM5-80GB | ||
3D U-Net | 1.9 | 0.908 Mean DICE score | 72x H100 | Eos_n9 | 3.1-2063 | Mixed | KiTS19 | H100-SXM5-80GB | |
0.8 | 0.908 Mean DICE score | 768x H100 | Eos_n96 | 3.1-2064 | Mixed | KiTS19 | H100-SXM5-80GB | ||
PyTorch | BERT | 0.9 | 0.72 Mask-LM accuracy | 64x H100 | Eos_n8 | 3.1-2061 | Mixed | Wikipedia 2020/01/01 | H100-SXM5-80GB |
0.1 | 0.72 Mask-LM accuracy | 3,472x H100 | Eos_n434 | 3.1-2053 | Mixed | Wikipedia 2020/01/01 | H100-SXM5-80GB | ||
Mask R-CNN | 4.3 | 0.377 Box min AP and 0.339 Mask min AP | 64x H100 | Eos_n8 | 3.1-2061 | Mixed | COCO2017 | H100-SXM5-80GB | |
1.5 | 0.377 Box min AP and 0.339 Mask min AP | 384x H100 | Eos_n48 | 3.1-2054 | Mixed | COCO2017 | H100-SXM5-80GB | ||
RNN-T | 4.2 | 0.058 Word Error Rate | 64x H100 | Eos_n8 | 3.1-2061 | Mixed | LibriSpeech | H100-SXM5-80GB | |
1.7 | 0.058 Word Error Rate | 512x H100 | Eos_n64 | 3.1-2056 | Mixed | LibriSpeech | H100-SXM5-80GB | ||
RetinaNet | 6.1 | 34.0% mAP | 64x H100 | Eos_n8 | 3.1-2062 | Mixed | A subset of OpenImages | H100-SXM5-80GB | |
0.9 | 34.0% mAP | 2,048x H100 | Eos_n256 | 3.1-2052 | Mixed | A subset of OpenImages | H100-SXM5-80GB | ||
NVIDIA Merlin HugeCTR | DLRM-dcnv2 | 1.4 | 0.80275 AUC | 64x H100 | Eos_n8 | 3.1-2059 | Mixed | Criteo 4TB | H100-SXM5-80GB |
1.0 | 0.80275 AUC | 128x H100 | Eos_n16 | 3.1-2051 | Mixed | Criteo 4TB | H100-SXM5-80GB |
MLPerf™ v3.1 Training Closed: 3.1-2005, 3.1-2006, 3.1-2007, 3.1-2008, 3.1-2009, 3.1-2010, 3.1-2011, 3.1-2019, 3.1-2028, 3.1-2047, 3.1-2048, 3.1-2050, 3.1-2051, 3.1-2052, 3.1-2053, 3.1-2054, 3.1-2055, 3.1-2056, 3.1-2057, 3.1-2058, 3.1-2059, 3.1-2060, 3.1-2061, 3.1-2062, 3.1-2063, 3.1-2064, 3.1-2065, 3.1-2068 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For Training rules and guidelines, click here
NVIDIA Performance on MLPerf 3.0’s Training HPC Benchmarks: Closed Division
Framework | Network | Time to Train (mins) |
MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|
PyTorch | CosmoFlow | 2.1 | Mean average error 0.124 | 512x H100 | eos | 3.0-8006 | Mixed | CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets | H100-SXM5-80GB |
DeepCAM | 0.8 | IOU 0.82 | 2,048x H100 | eos | 3.0-8007 | Mixed | CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background) | H100-SXM5-80GB | |
OpenCatalyst | 10.7 | Forces mean absolute error 0.036 | 640x H100 | eos | 3.0-8008 | Mixed | Open Catalyst 2020 (OC20) S2EF 2M training split, ID validation set | H100-SXM5-80GB | |
OpenFold | 7.5 | Local Distance Difference Test (lDDT-Cα) >= 0.8 | 2,080x H100 | eos | 3.0-8009 | Mixed | OpenProteinSet and Protein Data Bank | H100-SXM5-80GB |
MLPerf™ v3.0 Training HPC Closed: 3.0-8006, 3.0-8007, 3.0-8008, 3.0-8009 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v3.0 Training HPC rules and guidelines, click here
LLM Training Performance on NVIDIA Data Center Products
H100 Training Performance
Framework | Framework Version | Network | Time to Train (days) | Throughput per GPU | GPU | Server | Container | Sequence Length | TP | PP | Precision | Global Batch Size | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Nemo | 1.23 | GPT3 5B | 0.5 | 23,574 tokens/sec | 64x H100 | Eos | nemo:24.03 | 2,048 | 1 | 1 | FP8 | 2,048 | H100 SXM5 80GB |
1.23 | GPT3 20B | 2 | 5,528 tokens/sec | 64x H100 | Eos | nemo:24.03 | 2,048 | 2 | 1 | FP8 | 256 | H100 SXM5 80GB | |
1.23 | Llama2 7B | 0.7 | 16,290 tokens/sec | 8x H100 | Eos | nemo:24.03 | 4,096 | 1 | 1 | FP8 | 128 | H100 SXM5 80GB | |
1.23 | Llama2 13B | 1.4 | 8,317 tokens/sec | 16x H100 | Eos | nemo:24.03 | 4,096 | 1 | 4 | FP8 | 128 | H100 SXM5 80GB | |
1.23 | Llama2 70B | 6.6 | 1,725 tokens/sec | 64x H100 | Eos | nemo:24.03 | 4,096 | 4 | 4 | FP8 | 128 | H100 SXM5 80GB |
TP: Tensor Parallelism
PP: Pipeline Parallelism
Time to Train is estimated time to train on 1T tokens with 1K GPUs
Converged Training Performance on NVIDIA Data Center GPUs
H100 Training Performance
Framework | Framework Version | Network | Time to Train (mins) |
Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 2.3.0a0 | Tacotron2 | 67 | .56 Training Loss | 469,109 total output mels/sec | 8x H100 | DGX H100 | 24.02-py3 | Mixed | 128 | LJSpeech 1.1 | H100 SXM5 80GB |
2.3.0a0 | WaveGlow | 119 | -5.8 Training Loss | 3,645,916 output samples/sec | 8x H100 | DGX H100 | 24.02-py3 | Mixed | 10 | LJSpeech 1.1 | H100 SXM5 80GB | |
2.3.0a0 | GNMT v2 | 9 | 24.15 BLEU Score | 1,699,570 total tokens/sec | 8x H100 | DGX H100 | 23.12-py3 | Mixed | 128 | wmt16-en-de | H100 SXM5 80GB | |
2.3.0a0 | NCF | 0.27 | .96 Hit Rate at 10 | 218,094,053 samples/sec | 8x H100 | DGX H100 | 24.02-py3 | Mixed | 131072 | MovieLens 20M | H100 SXM5 80GB | |
2.3.0a0 | FastPitch | 75 | .17 Training Loss | 1,331,733 frames/sec | 8x H100 | DGX H100 | 24.02-py3 | TF32 | 32 | LJSpeech 1.1 | H100 SXM5 80GB | |
2.3.0a0 | Transformer XL Large | 318 | 17.83 Perplexity | 262,462 total tokens/sec | 8x H100 | DGX H100 | 24.02-py3 | Mixed | 16 | WikiText-103 | H100 SXM5 80GB | |
2.3.0a0 | Transformer XL Base | 141 | 21.61 Perplexity | 952,253 total tokens/sec | 8x H100 | DGX H100 | 24.02-py3 | Mixed | 128 | WikiText-103 | H100 SXM5 80GB | |
2.3.0a0 | EfficientNet-B4 | 1,667 | 82.02 Top 1 | 5,231 images/sec | 8x H100 | DGX H100 | 24.02-py3 | Mixed | 128 | Imagenet2012 | H100 SXM5 80GB | |
2.1.0a0 | EfficientDet-D0 | 325 | .33 BBOX mAP | 2,658 images/sec | 8x H100 | DGX H100 | 23.10-py3 | Mixed | 150 | COCO 2017 | H100 SXM5 80GB | |
2.3.0a0 | EfficientNet-WideSE-B4 | 1,673 | 82.01 Top 1 | 5,218 images/sec | 8x H100 | DGX H100 | 24.02-py3 | Mixed | 128 | Imagenet2012 | H100 SXM5 80GB | |
2.2.0a0 | TFT-Electricity | 2 | .03 Test P90 | 145,082 items/sec | 8x H100 | DGX H100 | 23.12-py3 | Mixed | 1024 | Electricity | H100 SXM5 80GB | |
2.3.0a0 | HiFiGAN | 948 | 9.42 Training Loss | 115,461 total output mels/sec | 8x H100 | DGX H100 | 24.02-py3 | Mixed | 16 | LJSpeech-1.1 | H100 SXM5 80GB | |
2.3.0a0 | GPUNet-0 | 1,052 | 78.91 Top 1 | 9,950 images/sec | 8x H100 | DGX H100 | 24.02-py3 | Mixed | 192 | Imagenet2012 | H100 SXM5 80GB | |
2.3.0a0 | GPUNet-1 | 960 | 80.45 Top 1 | 10,946 images/sec | 8x H100 | DGX H100 | 24.02-py3 | Mixed | 192 | Imagenet2012 | H100 SXM5 80GB | |
2.3.0a0 | MoFlow | 35 | 89.67 NUV | 46,451 molecules/sec | 8x H100 | DGX H100 | 24.02-py3 | Mixed | 512 | ZINC | H100 SXM5 80GB | |
Tensorflow | 2.13.0 | U-Net Medical | 1 | .89 DICE Score | 2,139 images/sec | 8x H100 | DGX H100 | 23.10-py3 | Mixed | 8 | EM segmentation challenge | H100 SXM5-80GB |
2.15.0 | Electra Fine Tuning | 2 | 92.59 F1 | 5,062 sequences/sec | 8x H100 | DGX H100 | 24.02-py3 | Mixed | 32 | SQuaD v1.1 | H100 SXM5 80GB | |
2.13.0 | Wide and Deep | 4 | .66 MAP at 12 | 12,217,033 samples/sec | 8x H100 | DGX H100 | 23.10-py3 | Mixed | 16384 | Kaggle Outbrain Click Prediction | H100 SXM5 80GB |
A30 Training Performance
Framework | Framework Version | Network | Time to Train (mins) |
Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 2.3.0a0 | Tacotron2 | 131 | .52 Training Loss | 232,954 total output mels/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | 104 | LJSpeech 1.1 | A30 |
2.3.0a0 | WaveGlow | 403 | . Training Loss | 1,042,579 output samples/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | 10 | LJSpeech 1.1 | A30 | |
2.3.0a0 | GNMT v2 | 49 | 24.21 BLEU Score | 309,310 total tokens/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | 128 | wmt16-en-de | A30 | |
2.3.0a0 | NCF | 1 | .96 Hit Rate at 10 | 41,848,626 samples/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | 131072 | MovieLens 20M | A30 | |
2.3.0a0 | FastPitch | 156 | .17 Training Loss | 545,724 frames/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | 16 | LJSpeech 1.1 | A30 | |
2.3.0a0 | Transformer XL Base | 198 | 22.87 Perplexity | 168,704 total tokens/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | 32 | WikiText-103 | A30 | |
2.3.0a0 | EfficientNet-B0 | 793 | 77.13 Top 1 | 11,235 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | 128 | Imagenet2012 | A30 | |
2.3.0a0 | EfficientNet-WideSE-B0 | 820 | 77.21 Top 1 | 10,863 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | 128 | Imagenet2012 | A30 | |
2.2.0a0 | MoFlow | 100 | 87.86 NUV | 12,351 molecules/sec | 8x A30 | GIGABYTE G482-Z52-00 | 23.12-py3 | Mixed | 512 | ZINC | A30 | |
Tensorflow | 2.13.0 | U-Net Medical | 4 | .89 DICE Score | 460 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 23.10-py3 | Mixed | 8 | EM segmentation challenge | A30 |
2.15.0 | Electra Fine Tuning | 5 | 92.63 F1 | 1,024 sequences/sec | 8x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | 16 | SQuaD v1.1 | A30 | |
2.14.0 | SIM | 1 | .81 AUC | 2,481,945 samples/sec | 8x A30 | GIGABYTE G482-Z52-00 | 23.12-py3 | Mixed | 16384 | Amazon Reviews | A30 |
A10 Training Performance
Framework | Framework Version | Network | Time to Train (mins) |
Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 2.3.0a0 | Tacotron2 | 144 | .53 Training Loss | 214,246 total output mels/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | 104 | LJSpeech 1.1 | A10 |
2.3.0a0 | WaveGlow | 541 | -5.73 Training Loss | 776,764 output samples/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | 10 | LJSpeech 1.1 | A10 | |
2.3.0a0 | GNMT v2 | 53 | 24.2 BLEU Score | 282,447 total tokens/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | 128 | wmt16-en-de | A10 | |
2.3.0a0 | NCF | 2 | .96 Hit Rate at 10 | 32,920,397 samples/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | TF32 | 131072 | MovieLens 20M | A10 | |
2.3.0a0 | FastPitch | 180 | .17 Training Loss | 460,415 frames/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | 16 | LJSpeech 1.1 | A10 | |
2.3.0a0 | EfficientNet-B0 | 1,045 | 77.11 Top 1 | 8,625 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | 128 | Imagenet2012 | A10 | |
2.3.0a0 | EfficientNet-WideSE-B0 | 1,076 | 77.31 Top 1 | 8,487 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | 128 | Imagenet2012 | A10 | |
2.2.0a0 | MoFlow | 93 | 86.86 NUV | 13,184 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 23.12-py3 | Mixed | 512 | Medical Segmentation Decathlon | A10 | |
Tensorflow | 2.13.0 | U-Net Medical | 4 | .89 DICE Score | 352 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 23.10-py3 | Mixed | 8 | EM segmentation challenge | A10 |
2.15.0 | Electra Fine Tuning | 5 | 92.52 F1 | 826 sequences/sec | 8x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | 16 | SQuaD v1.1 | A10 | |
2.14.0 | SIM | 1 | .8 AUC | 2,346,013 samples/sec | 8x A10 | GIGABYTE G482-Z52-00 | 23.12-py3 | Mixed | 16384 | Amazon Reviews | A10 |
View More Performance Data
AI Inference
Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.
Learn MoreAI Pipeline
NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.
Learn More