AI Training
Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.
Click here to view other performance data.
NVIDIA Performance on MLPerf 3.1 Training Benchmarks
NVIDIA Performance on MLPerf 3.1’s AI Benchmarks: Single Node, Closed Division
Framework | Network | Time to Train (mins) |
MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|
Nemo | Stable Diffusion | 46.8 | FID⇐90 and and CLIP>=0.15 | 8x H100 | XE9680x8H100-SXM-80GB | 3.1-2019 | Mixed | LAION-400M-filtered | H100-SXM5-80GB |
MXNet | ResNet-50 v1.5 | 13.4 | 75.90% classification | 8x H100 | ESC-N8-E11 | 3.1-2011 | Mixed | ImageNet | H100-SXM5-80GB |
3D U-Net | 13.1 | 0.908 Mean DICE score | 8x H100 | AS-8125GS-TNHR | 3.1-2068 | Mixed | KiTS19 | H100-SXM5-80GB | |
PyTorch | BERT | 5.4 | 0.72 Mask-LM accuracy | 8x H100 | ESC-N8-E11 | 3.1-2011 | Mixed | Wikipedia 2020/01/01 | H100-SXM5-80GB |
Mask R-CNN | 19.2 | 0.377 Box min AP and 0.339 Mask min AP | 8x H100 | Eos_n1 | 3.1-2048 | Mixed | COCO2017 | H100-SXM5-80GB | |
RNN-T | 16.2 | 0.058 Word Error Rate | 8x H100 | GIGABYTE G593-ZD2 | 3.1-2028 | Mixed | LibriSpeech | H100-SXM5-80GB | |
RetinaNet | 36.0 | 34.0% mAP | 8x H100 | ESC-N8-E11 | 3.1-2011 | Mixed | A subset of OpenImages | H100-SXM5-80GB | |
NVIDIA Merlin HugeCTR | DLRM-dcnv2 | 3.9 | 0.80275 AUC | 8x H100 | Eos_n1 | 3.1-2047 | Mixed | Criteo 4TB | H100-SXM5-80GB |
NVIDIA Performance on MLPerf 3.1’s AI Benchmarks: Multi Node, Closed Division
Framework | Network | Time to Train (mins) |
MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|
NVIDIA NeMo | GPT3 | 58.3 | 2.69 log perplexity | 512x H100 | Eos_n64 | 3.1-2057 | Mixed | c4/en/3.0.1 | H100-SXM5-80GB |
40.6 | 2.69 log perplexity | 768x H100 | Eos_n96 | 3.1-2065 | Mixed | c4/en/3.0.1 | H100-SXM5-80GB | ||
8.6 | 2.69 log perplexity | 4,096x H100 | Eos-dfw_n512 | 3.1-2008 | Mixed | c4/en/3.0.1 | H100-SXM5-80GB | ||
6.0 | 2.69 log perplexity | 6,144x H100 | Eos-dfw_n768 | 3.1-2009 | Mixed | c4/en/3.0.1 | H100-SXM5-80GB | ||
4.9 | 2.69 log perplexity | 8,192x H100 | Eos-dfw_n1024 | 3.1-2005 | Mixed | c4/en/3.0.1 | H100-SXM5-80GB | ||
4.1 | 2.69 log perplexity | 10,240x H100 | Eos-dfw_n1280 | 3.1-2006 | Mixed | c4/en/3.0.1 | H100-SXM5-80GB | ||
3.9 | 2.69 log perplexity | 10,752x H100 | Eos-dfw_n1344 | 3.1-2007 | Mixed | c4/en/3.0.1 | H100-SXM5-80GB | ||
Stable Diffusion | 10.0 | FID⇐90 and and CLIP>=0.15 | 64x H100 | Eos_n8 | 3.1-2060 | Mixed | LAION-400M-filtered | H100-SXM5-80GB | |
2.9 | FID⇐90 and and CLIP>=0.15 | 512x H100 | Eos_n64 | 3.1-2055 | Mixed | LAION-400M-filtered | H100-SXM5-80GB | ||
2.5 | FID⇐90 and and CLIP>=0.15 | 1,024x H100 | Eos_n128 | 3.1-2050 | Mixed | LAION-400M-filtered | H100-SXM5-80GB | ||
MXNet | ResNet-50 v1.5 | 2.5 | 75.90% classification | 64x H100 | Eos_n8 | 3.1-2058 | Mixed | ImageNet | H100-SXM5-80GB |
0.2 | 75.90% classification | 3,584x H100 | coreweave_hgxh100_n448_ngc23.04_mxnet | 3.1-2010 | Mixed | ImageNet | H100-SXM5-80GB | ||
3D U-Net | 1.9 | 0.908 Mean DICE score | 72x H100 | Eos_n9 | 3.1-2063 | Mixed | KiTS19 | H100-SXM5-80GB | |
0.8 | 0.908 Mean DICE score | 768x H100 | Eos_n96 | 3.1-2064 | Mixed | KiTS19 | H100-SXM5-80GB | ||
PyTorch | BERT | 0.9 | 0.72 Mask-LM accuracy | 64x H100 | Eos_n8 | 3.1-2061 | Mixed | Wikipedia 2020/01/01 | H100-SXM5-80GB |
0.1 | 0.72 Mask-LM accuracy | 3,472x H100 | Eos_n434 | 3.1-2053 | Mixed | Wikipedia 2020/01/01 | H100-SXM5-80GB | ||
Mask R-CNN | 4.3 | 0.377 Box min AP and 0.339 Mask min AP | 64x H100 | Eos_n8 | 3.1-2061 | Mixed | COCO2017 | H100-SXM5-80GB | |
1.5 | 0.377 Box min AP and 0.339 Mask min AP | 384x H100 | Eos_n48 | 3.1-2054 | Mixed | COCO2017 | H100-SXM5-80GB | ||
RNN-T | 4.2 | 0.058 Word Error Rate | 64x H100 | Eos_n8 | 3.1-2061 | Mixed | LibriSpeech | H100-SXM5-80GB | |
1.7 | 0.058 Word Error Rate | 512x H100 | Eos_n64 | 3.1-2056 | Mixed | LibriSpeech | H100-SXM5-80GB | ||
RetinaNet | 6.1 | 34.0% mAP | 64x H100 | Eos_n8 | 3.1-2062 | Mixed | A subset of OpenImages | H100-SXM5-80GB | |
0.9 | 34.0% mAP | 2,048x H100 | Eos_n256 | 3.1-2052 | Mixed | A subset of OpenImages | H100-SXM5-80GB | ||
NVIDIA Merlin HugeCTR | DLRM-dcnv2 | 1.4 | 0.80275 AUC | 64x H100 | Eos_n8 | 3.1-2059 | Mixed | Criteo 4TB | H100-SXM5-80GB |
1.0 | 0.80275 AUC | 128x H100 | Eos_n16 | 3.1-2051 | Mixed | Criteo 4TB | H100-SXM5-80GB |
MLPerf™ v3.1 Training Closed: 3.1-2005, 3.1-2006, 3.1-2007, 3.1-2008, 3.1-2009, 3.1-2010, 3.1-2011, 3.1-2019, 3.1-2028, 3.1-2047, 3.1-2048, 3.1-2050, 3.1-2051, 3.1-2052, 3.1-2053, 3.1-2054, 3.1-2055, 3.1-2056, 3.1-2057, 3.1-2058, 3.1-2059, 3.1-2060, 3.1-2061, 3.1-2062, 3.1-2063, 3.1-2064, 3.1-2065, 3.1-2068 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For Training rules and guidelines, click here
NVIDIA Performance on MLPerf 3.0’s Training HPC Benchmarks: Closed Division
Framework | Network | Time to Train (mins) |
MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|
PyTorch | CosmoFlow | 2.1 | Mean average error 0.124 | 512x H100 | eos | 3.0-8006 | Mixed | CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets | H100-SXM5-80GB |
DeepCAM | 0.8 | IOU 0.82 | 2,048x H100 | eos | 3.0-8007 | Mixed | CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background) | H100-SXM5-80GB | |
OpenCatalyst | 10.7 | Forces mean absolute error 0.036 | 640x H100 | eos | 3.0-8008 | Mixed | Open Catalyst 2020 (OC20) S2EF 2M training split, ID validation set | H100-SXM5-80GB | |
OpenFold | 7.5 | Local Distance Difference Test (lDDT-Cα) >= 0.8 | 2,080x H100 | eos | 3.0-8009 | Mixed | OpenProteinSet and Protein Data Bank | H100-SXM5-80GB |
MLPerf™ v3.0 Training HPC Closed: 3.0-8006, 3.0-8007, 3.0-8008, 3.0-8009 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v3.0 Training HPC rules and guidelines, click here
LLM Training Performance on NVIDIA Data Center Products
H100 Training Performance
Framework | Framework Version | Network | Throughput | GPU | Server | Container | Sequence Length | TP | PP | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Nemo | 1.19.0 | GPT3 5B | 1,290,000 tokens/sec | 64x H100 | DGX H100 | 23.05-py3 | 2048 | 1 | 1 | FP8 | 16 | PILE | H100 SXM5-80GB |
1.19.0 | GPT3 20B | 311,000 tokens/sec | 64x H100 | DGX H100 | 23.05-py3 | 2048 | 4 | 1 | FP8 | 4 | PILE | H100 SXM5-80GB | |
1.19.0 | GPT3 175B | 86,100 tokens/sec | 128x H100 | DGX H100 | 23.05-py3 | 2048 | 4 | 8 | FP8 | 2 | PILE | H100 SXM5-80GB | |
1.21.0 | Llama2 7B | 51,100 tokens/sec | 8x H100 | DGX H100 | 23.08-py3 | 4096 | 1 | 1 | FP8 | 16 | PILE | H100 SXM5-80GB | |
1.21.0 | Llama2 13B | 50,900 tokens/sec | 16x H100 | DGX H100 | 23.08-py3 | 4096 | 2 | 1 | FP8 | 8 | PILE | H100 SXM5-80GB |
TP: Tensor Parallelism
PP: Pipeline Parallelism
H100 Multi-Node Scaling Training Performance
Framework | Framework Version | Network | Throughput | Number of Nodes | GPU | Server | Container | Sequence Length | TP | PP | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Nemo | 1.19.0 | GPT3 175B | 86,400 tokens/sec | 16 | 128x H100 | DGX H100 | 23.05-py3 | 2048 | 4 | 8 | FP8 | 2 | PILE | H100 SXM5-80GB |
1.19.0 | GPT3 175B | 173,000 tokens/sec | 32 | 256x H100 | DGX H100 | 23.05-py3 | 2048 | 4 | 8 | FP8 | 2 | PILE | H100 SXM5-80GB | |
1.19.0 | GPT3 175B | 342,000 tokens/sec | 64 | 512x H100 | DGX H100 | 23.05-py3 | 2048 | 4 | 8 | FP8 | 2 | PILE | H100 SXM5-80GB |
TP: Tensor Parallelism
PP: Pipeline Parallelism
Converged Training Performance on NVIDIA Data Center GPUs
H100 Training Performance
Framework | Framework Version | Network | Time to Train (mins) |
Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 2.1.0a0 | Tacotron2 | 61 | .56 Training Loss | 494,202 total output mels/sec | 8x H100 | DGX H100 | 23.07-py3 | Mixed | 128 | LJSpeech 1.1 | H100 SXM5-80GB |
2.1.0a0 | WaveGlow | 119 | -5.7 Training Loss | 3,631,002 output samples/sec | 8x H100 | DGX H100 | 23.08-py3 | Mixed | 10 | LJSpeech 1.1 | H100 SXM5-80GB | |
2.1.0a0 | GNMT v2 | 11 | 24.36 BLEU Score | 1,679,675 total tokens/sec | 8x H100 | DGX H100 | 23.08-py3 | Mixed | 128 | wmt16-en-de | H100 SXM5-80GB | |
2.1.0a0 | Transformer | 107 | 27.77 BLEU Score | 945,566 Tokens per Second | 8x H100 | DGX H100 | 23.08-py3 | Mixed | 10240 | wmt14-en-de | H100 SXM5-80GB | |
2.1.0a0 | EfficientNet-B4 | 1,622 | 81.82 Top 1 | 5,375 images/sec | 8x H100 | DGX H100 | 23.08-py3 | Mixed | 128 | Imagenet2012 | H100 SXM5-80GB | |
2.1.0a0 | EfficientDet-D0 | 327 | .33 BBOX mAP | 2,670 images/sec | 8x H100 | DGX H100 | 23.08-py3 | Mixed | 150 | COCO 2017 | H100 SXM5-80GB | |
2.1.0a0 | EfficientNet-WideSE-B4 | 1,639 | 82. Top 1 | 5,323 images/sec | 8x H100 | DGX H100 | 23.08-py3 | Mixed | 128 | Imagenet2012 | H100 SXM5-80GB | |
2.1.0a0 | HiFiGAN | 1,035 | 9.37 Training Loss | 105,894 total output mels/sec | 8x H100 | DGX H100 | 23.07-py3 | Mixed | 16 | LJSpeech-1.1 | H100-SXM5-80GB | |
Tensorflow | 2.12.0 | U-Net Medical | 1 | .89 DICE Score | 2,238 images/sec | 8x H100 | DGX H100 | 23.07-py3 | Mixed | 8 | EM segmentation challenge | H100 SXM5-80GB |
2.13.0 | Electra Fine Tuning | 2 | 92.57 F1 | 5,367 sequences/sec | 8x H100 | DGX H100 | 23.08-py3 | Mixed | 32 | SQuaD v1.1 | H100 SXM5-80GB | |
2.12.0 | Wide and Deep | 4 | .66 MAP at 12 | 12,780,170 samples/sec | 8x H100 | DGX H100 | 23.07-py3 | Mixed | 16384 | Kaggle Outbrain Click Prediction | H100 SXM5-80GB | |
Nemo | - | ViT g/14 | 2,137 | 70.2 Top 1 | 1,610 images/sec | 8x H100 | DGX H100 | 23.06-py3 | BF16 | 8192 | ImageNet2012 | H100-SXM5-80GB |
- | ViT H/14 | 5,450 | 75. Top 1 | 2,359 images/sec | 8x H100 | DGX H100 | 23.06-py3 | BF16 | 8192 | ImageNet2012 | H100-SXM5-80GB |
A40 Training Performance
Framework | Framework Version | Network | Time to Train (mins) |
Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 2.0.0a0 | NCF | 1 | .96 Hit Rate at 10 | 49,380,246 samples/sec | 8x A40 | GIGABYTE G482-Z52-00 | 23.05-py3 | Mixed | 131072 | MovieLens 20M | A40 |
2.0.0a0 | Tacotron2 | 112 | .56 Training Loss | 271,434 total output mels/sec | 8x A40 | Supermicro AS -4124GS-TNR | 23.03-py3 | Mixed | 128 | LJSpeech 1.1 | A40 | |
2.1.0a0 | WaveGlow | 428 | -5.94 Training Loss | 986,709 output samples/sec | 8x A40 | Supermicro AS -4124GS-TNR | 23.07-py3 | Mixed | 10 | LJSpeech 1.1 | A40 | |
2.1.0a0 | GNMT v2 | 46 | 24.3 BLEU Score | 329,065 total tokens/sec | 8x A40 | Supermicro AS -4124GS-TNR | 23.07-py3 | Mixed | 128 | wmt16-en-de | A40 | |
2.1.0a0 | FastPitch | 141 | .17 Training Loss | 617,166 frames/sec | 8x A40 | Supermicro AS -4124GS-TNR | 23.07-py3 | Mixed | 32 | LJSpeech 1.1 | A40 | |
2.1.0a0 | EfficientNet-B4 | 6,390 | 81.93 Top 1 | 1,360 images/sec | 8x A40 | Supermicro AS -4124GS-TNR | 23.07-py3 | Mixed | 64 | Imagenet2012 | A40 | |
2.1.0a0 | EfficientNet-WideSE-B0 | 855 | 77.08 Top 1 | 10,467 images/sec | 8x A40 | Supermicro AS -4124GS-TNR | 23.07-py3 | Mixed | 256 | Imagenet2012 | A40 | |
2.1.0a0 | EfficientDet-D0 | 708 | .34 BBOX mAP | 1,112 images/sec | 8x A40 | Supermicro AS -4124GS-TNR | 23.07-py3 | Mixed | 60 | COCO 2017 | A40 | |
Tensorflow | 2.12.0 | SIM | 1 | .79 AUC | 2,450,911 samples/sec | 8x A40 | GIGABYTE G482-Z52-00 | 23.07-py3 | Mixed | 16384 | Amazon Reviews | A40 |
A30 Training Performance
Framework | Framework Version | Network | Time to Train (mins) |
Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 2.1.0a0 | Tacotron2 | 131 | .53 Training Loss | 236,272 total output mels/sec | 8x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 104 | LJSpeech 1.1 | A30 |
2.1.0a0 | WaveGlow | 411 | -5.73 Training Loss | 1,030,817 output samples/sec | 8x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 10 | LJSpeech 1.1 | A30 | |
2.1.0a0 | GNMT v2 | 49 | 24.14 BLEU Score | 310,790 total tokens/sec | 8x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 128 | wmt16-en-de | A30 | |
2.0.0 | NCF | 1 | .96 Hit Rate at 10 | 51,737,678 samples/sec | 8x A30 | GIGABYTE G482-Z52-00 | 23.05-py3 | Mixed | 131072 | MovieLens 20M | A30 | |
2.1.0a0 | FastPitch | 159 | .17 Training Loss | 531,205 frames/sec | 8x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 16 | LJSpeech 1.1 | A30 | |
2.1.0a0 | EfficientDet-D0 | 933 | .34 BBOX mAP | 796 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 30 | COCO 2017 | A30 | |
2.1.0a0 | EfficientNet-WideSE-B0 | 823 | 77.25 Top 1 | 10,782 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 128 | Imagenet2012 | A30 | |
Tensorflow | 2.13.0 | U-Net Medical | 4 | .89 DICE Score | 473 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 8 | EM segmentation challenge | A30 |
2.13.0 | Electra Fine Tuning | 5 | 92.8 F1 | 1,019 sequences/sec | 8x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 16 | SQuaD v1.1 | A30 | |
2.13.0 | SIM | 1 | .81 AUC | 2,573,991 samples/sec | 8x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 16384 | Amazon Reviews | A30 |
A10 Training Performance
Framework | Framework Version | Network | Time to Train (mins) |
Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 2.1.0a0 | Tacotron2 | 142 | .53 Training Loss | 217,183 total output mels/sec | 8x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 104 | LJSpeech 1.1 | A10 |
2.1.0a0 | GNMT v2 | 54 | 24.17 BLEU Score | 281,565 total tokens/sec | 8x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 128 | wmt16-en-de | A10 | |
2.0.0 | NCF | 1 | .96 Hit Rate at 10 | 42,446,575 samples/sec | 8x A10 | GIGABYTE G482-Z52-00 | 23.05-py3 | Mixed | 131072 | MovieLens 20M | A10 | |
2.1.0a0 | FastPitch | 189 | .17 Training Loss | 439,357 frames/sec | 8x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 16 | LJSpeech 1.1 | A10 | |
2.1.0a0 | EfficientDet-D0 | 964 | .34 BBOX mAP | 720 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 30 | COCO 2017 | A10 | |
2.1.0a0 | EfficientNet-WideSE-B0 | 1,080 | 77.19 Top 1 | 8,351 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 128 | Imagenet2012 | A10 | |
Tensorflow | 2.13.0 | U-Net Medical | 3 | .89 DICE Score | 364 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 8 | EM segmentation challenge | A10 |
2.13.0 | Electra Fine Tuning | 5 | 92.78 F1 | 790 sequences/sec | 8x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 16 | SQuaD v1.1 | A10 | |
2.13.0 | SIM | 1 | .79 AUC | 2,313,420 samples/sec | 8x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | 16384 | Amazon Reviews | A10 |
Converged Training Performance of NVIDIA Data Center GPUs on Cloud
A100 Training Performance
Framework | Framework Version | Network | Time to Train (mins) |
Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
MXNet | - | ResNet-50 v1.5 | 78 | 77.18 Top 1 | 25,491 images/sec | 8x A100 | Azure Standard_ND96amsr_A100_v4 | 23.06-py3 | Mixed | 256 | ImageNet2012 | A100-SXM4-80GB |
- | ResNet-50 v1.5 | 85 | 77.06 Top 1 | 23,159 images/sec | 8x A100 | GCP A2-HIGHGPU-8G | 23.08-py3 | Mixed | 256 | ImageNet2012 | A100-SXM4-40GB |
View More Performance Data
AI Inference
Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.
Learn MoreAI Pipeline
NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.
Learn More