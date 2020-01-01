Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.


NVIDIA Performance on MLPerf 3.0 Training Benchmarks



NVIDIA Performance on MLPerf 3.0’s AI Benchmarks: Single Node, Closed Division

Framework Network Time to Train
(mins) 		MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
MXNet ResNet-50 v1.5 13.466 75.90% classification 8x H100 XE9680x8H100-SXM-80GB 3.0-2053 Mixed ImageNet H100-SXM5-80GB
3D U-Net 11.796 0.908 Mean DICE score 8x H100 G593-SD0 3.0-2054 Mixed KiTS19 H100-SXM5-80GB
PyTorch BERT 5.349 0.72 Mask-LM accuracy 8x H100 G593-SD0 3.0-2055 Mixed Wikipedia 2020/01/01 H100-SXM5-80GB
Mask R-CNN 19.180 0.377 Box min AP and 0.339 Mask min AP 8x H100 DGX H100 3.0-2064 Mixed COCO H100-SXM5-80GB
RNN-T 16.686 0.058 Word Error Rate 8x H100 DGX H100 3.0-2064 Mixed LibriSpeech H100-SXM5-80GB
RetinaNet 37.009 34.0% mAP 8x H100 DGX H100 3.0-2064 Mixed A subset of OpenImages H100-SXM5-80GB
NVIDIA Merlin HugeCTR DLRM-dcnv2 4.184 0.80275 AUC 8x H100 DGX H100 3.0-2063 Mixed Criteo 4TB H100-SXM5-80GB

NVIDIA Performance on MLPerf 3.0’s AI Benchmarks: Multi Node, Closed Division

Framework Network Time to Train
(mins) 		MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
MXNet ResNet-50 v1.5 2.664 75.90% classification 64x H100 DGX H100 3.0-2071 Mixed ImageNet H100-SXM5-80GB
0.509 75.90% classification 512x H100 DGX H100 3.0-2068 Mixed ImageNet H100-SXM5-80GB
0.369 75.90% classification 768x H100 DGX H100 3.0-2075 Mixed ImageNet H100-SXM5-80GB
0.183 75.90% classification 3,584x H100 coreweave_hgxh100_n448 3.0-2002 Mixed ImageNet H100-SXM5-80GB
3D U-Net 1.853 0.908 Mean DICE score 72x H100 DGX H100 3.0-2074 Mixed KiTS19 H100-SXM5-80GB
0.818 0.908 Mean DICE score 432x H100 DGX H100 3.0-2067 Mixed KiTS19 H100-SXM5-80GB
PyTorch BERT 0.898 0.72 Mask-LM accuracy 64x H100 DGX H100 3.0-2073 Mixed Wikipedia 2020/01/01 H100-SXM5-80GB
0.344 0.72 Mask-LM accuracy 512x H100 DGX H100 3.0-2070 Mixed Wikipedia 2020/01/01 H100-SXM5-80GB
0.253 0.72 Mask-LM accuracy 768x H100 DGX H100 3.0-2077 Mixed Wikipedia 2020/01/01 H100-SXM5-80GB
0.134 0.72 Mask-LM accuracy 3,072x H100 coreweave_hgxh100_n384 3.0-2001 Mixed Wikipedia 2020/01/01 H100-SXM5-80GB
Mask R-CNN 4.264 0.377 Box min AP and 0.339 Mask min AP 64x H100 DGX H100 3.0-2073 Mixed COCO H100-SXM5-80GB
1.466 0.377 Box min AP and 0.339 Mask min AP 384x H100 DGX H100 3.0-2066 Mixed COCO H100-SXM5-80GB
RNN-T 4.231 0.058 Word Error Rate 64x H100 DGX H100 3.0-2073 Mixed LibriSpeech H100-SXM5-80GB
1.649 0.058 Word Error Rate 512x H100 DGX H100 3.0-2070 Mixed LibriSpeech H100-SXM5-80GB
RetinaNet 6.511 34.0% mAP 64x H100 DGX H100 3.0-2073 Mixed A subset of OpenImages H100-SXM5-80GB
1.906 34.0% mAP 512x H100 DGX H100 3.0-2070 Mixed A subset of OpenImages H100-SXM5-80GB
1.511 34.0% mAP 768x H100 DGX H100 3.0-2077 Mixed A subset of OpenImages H100-SXM5-80GB
NVIDIA Merlin HugeCTR DLRM-dcnv2 1.760 0.80275 AUC 64x H100 DGX H100 3.0-2072 Mixed Criteo 4TB H100-SXM5-80GB
1.613 0.80275 AUC 128x H100 DGX H100 3.0-2065 Mixed Criteo 4TB H100-SXM5-80GB
NVIDIA NeMo GPT3 64.264 2.69 log perplexity 512x H100 DGX H100 3.0-2069 Mixed c4/en/3.0.1 H100-SXM5-80GB
44.816 2.69 log perplexity 768x H100 DGX H100 3.0-2076 Mixed c4/en/3.0.1 H100-SXM5-80GB
23.611 2.69 log perplexity 1,536x H100 coreweave_hgxh100_n192 3.0-2000 Mixed c4/en/3.0.1 H100-SXM5-80GB
10.940 2.69 log perplexity 3,584x H100 coreweave_hgxh100_n448 3.0-2003 Mixed c4/en/3.0.1 H100-SXM5-80GB

MLPerf™ v3.0 Training Closed: 3.0-2000, 3.0-2001, 3.0-2002, 3.0-2003, 3.0-2053, 3.0-2054,3.0-2055, 3.0-2063, 3.0-2064, 3.0-2065, 3.0-2066, 3.0-2067, 3.0-2068, 3.0-2069, 3.0-2070, 3.0-2071, 3.0-2072, 3.0-2073, 3.0-2074, 3.0-2075, 3.0-2076, 3.0-2077 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.


NVIDIA Performance on MLPerf 2.0’s Training HPC Benchmarks: Strong Scaling, Closed Division

Framework Network Time to Train
(mins) 		MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
PyTorch CosmoFlow 3.79 Mean average error 0.124 512x A100 DGX A100 2.0-8006 Mixed CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets A100-SXM4-80GB
DeepCAM 1.57 IOU 0.82 2,048x A100 DGX A100 2.0-8005 Mixed CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background) A100-SXM4-80GB
OpenCatalyst 21.93 Forces mean absolute error 0.036 512x A100 DGX A100 2.0-8006 Mixed Open Catalyst 2020 (OC20) S2EF 2M training split, ID validation set A100-SXM4-80GB

NVIDIA Performance on MLPerf 2.0’s Training HPC Benchmarks: Weak Scaling, Closed Division

Framework Network Throughput MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
PyTorch CosmoFlow 4.21 models/min Mean average error 0.124 4,096x A100 DGX A100 2.0-8014 Mixed CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets A100-SXM4-80GB
DeepCAM 6.40 models/min IOU 0.82 4,096x A100 DGX A100 2.0-8014 Mixed CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background) A100-SXM4-80GB
OpenCatalyst 0.66 models/min Forces mean absolute error 0.036 4,096x A100 DGX A100 2.0-8014 Mixed Open Catalyst 2020 (OC20) S2EF 2M training split, ID validation set A100-SXM4-80GB

MLPerf™ v2.0 Training HPC Closed: 2.0-8005, 2.0-8006, 2.0-8014 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
Converged Training Performance on NVIDIA Data Center GPUs


H100 Training Performance



Framework Framework Version Network Time to Train
(mins)		 Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.1.0a0Tacotron261.56 Training Loss494,202 total output mels/sec8x H100DGX H10023.07-py3Mixed128LJSpeech 1.1H100 SXM5-80GB
2.1.0a0WaveGlow119-5.7 Training Loss3,631,002 output samples/sec8x H100DGX H10023.08-py3Mixed10LJSpeech 1.1H100 SXM5-80GB
2.1.0a0GNMT v21124.36 BLEU Score1,679,675 total tokens/sec8x H100DGX H10023.08-py3Mixed128wmt16-en-deH100 SXM5-80GB
2.1.0a0Transformer10727.77 BLEU Score945,566 Tokens per Second8x H100DGX H10023.08-py3Mixed10240wmt14-en-deH100 SXM5-80GB
2.1.0a0EfficientNet-B41,62281.82 Top 15,375 images/sec8x H100DGX H10023.08-py3Mixed128Imagenet2012H100 SXM5-80GB
2.1.0a0EfficientDet-D0327.33 BBOX mAP2,670 images/sec8x H100DGX H10023.08-py3Mixed150COCO 2017H100 SXM5-80GB
2.1.0a0EfficientNet-WideSE-B41,63982. Top 15,323 images/sec8x H100DGX H10023.08-py3Mixed128Imagenet2012H100 SXM5-80GB
2.1.0a0HiFiGAN1,0359.37 Training Loss105,894 total output mels/sec8x H100DGX H10023.07-py3Mixed16LJSpeech-1.1H100-SXM5-80GB
Tensorflow2.12.0U-Net Medical1.89 DICE Score2,238 images/sec8x H100DGX H10023.07-py3Mixed8EM segmentation challengeH100 SXM5-80GB
2.13.0Electra Fine Tuning292.57 F15,367 sequences/sec8x H100DGX H10023.08-py3Mixed32SQuaD v1.1H100 SXM5-80GB
2.12.0Wide and Deep4.66 MAP at 1212,780,170 samples/sec8x H100DGX H10023.07-py3Mixed16384Kaggle Outbrain Click PredictionH100 SXM5-80GB
Nemo-ViT g/142,13770.2 Top 11,610 images/sec8x H100DGX H10023.06-py3BF168192ImageNet2012H100-SXM5-80GB
-ViT H/145,45075. Top 12,359 images/sec8x H100DGX H10023.06-py3BF168192ImageNet2012H100-SXM5-80GB

A40 Training Performance


Framework Framework Version Network Time to Train
(mins)		 Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.0.0a0NCF1.96 Hit Rate at 1049,380,246 samples/sec8x A40GIGABYTE G482-Z52-0023.05-py3Mixed131072MovieLens 20MA40
2.0.0a0Tacotron2112.56 Training Loss271,434 total output mels/sec8x A40Supermicro AS -4124GS-TNR23.03-py3Mixed128LJSpeech 1.1A40
2.1.0a0WaveGlow428-5.94 Training Loss986,709 output samples/sec8x A40Supermicro AS -4124GS-TNR23.07-py3Mixed10LJSpeech 1.1A40
2.1.0a0GNMT v24624.3 BLEU Score329,065 total tokens/sec8x A40Supermicro AS -4124GS-TNR23.07-py3Mixed128wmt16-en-deA40
2.1.0a0FastPitch141.17 Training Loss617,166 frames/sec8x A40Supermicro AS -4124GS-TNR23.07-py3Mixed32LJSpeech 1.1A40
2.1.0a0EfficientNet-B46,39081.93 Top 11,360 images/sec8x A40Supermicro AS -4124GS-TNR23.07-py3Mixed64Imagenet2012A40
2.1.0a0EfficientNet-WideSE-B085577.08 Top 110,467 images/sec8x A40Supermicro AS -4124GS-TNR23.07-py3Mixed256Imagenet2012A40
2.1.0a0EfficientDet-D0708.34 BBOX mAP1,112 images/sec8x A40Supermicro AS -4124GS-TNR23.07-py3Mixed60COCO 2017A40
Tensorflow2.12.0SIM1.79 AUC2,450,911 samples/sec8x A40GIGABYTE G482-Z52-0023.07-py3Mixed16384Amazon ReviewsA40

A30 Training Performance


Framework Framework Version Network Time to Train
(mins)		 Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.1.0a0Tacotron2131.53 Training Loss236,272 total output mels/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed104LJSpeech 1.1A30
2.1.0a0WaveGlow411-5.73 Training Loss1,030,817 output samples/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed10LJSpeech 1.1A30
2.1.0a0GNMT v24924.14 BLEU Score310,790 total tokens/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed128wmt16-en-deA30
2.0.0NCF1.96 Hit Rate at 1051,737,678 samples/sec8x A30GIGABYTE G482-Z52-0023.05-py3Mixed131072MovieLens 20MA30
2.1.0a0FastPitch159.17 Training Loss531,205 frames/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed16LJSpeech 1.1A30
2.1.0a0EfficientDet-D0933.34 BBOX mAP796 images/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed30COCO 2017A30
2.1.0a0EfficientNet-WideSE-B082377.25 Top 110,782 images/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed128Imagenet2012A30
Tensorflow2.13.0U-Net Medical4.89 DICE Score473 images/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed8EM segmentation challengeA30
2.13.0Electra Fine Tuning592.8 F11,019 sequences/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed16SQuaD v1.1A30
2.13.0SIM1.81 AUC2,573,991 samples/sec8x A30GIGABYTE G482-Z52-0023.08-py3Mixed16384Amazon ReviewsA30

A10 Training Performance


Framework Framework Version Network Time to Train
(mins)		 Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.1.0a0Tacotron2142.53 Training Loss217,183 total output mels/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed104LJSpeech 1.1A10
2.1.0a0GNMT v25424.17 BLEU Score281,565 total tokens/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed128wmt16-en-deA10
2.0.0NCF1.96 Hit Rate at 1042,446,575 samples/sec8x A10GIGABYTE G482-Z52-0023.05-py3Mixed131072MovieLens 20MA10
2.1.0a0FastPitch189.17 Training Loss439,357 frames/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed16LJSpeech 1.1A10
2.1.0a0EfficientDet-D0964.34 BBOX mAP720 images/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed30COCO 2017A10
2.1.0a0EfficientNet-WideSE-B01,08077.19 Top 18,351 images/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed128Imagenet2012A10
Tensorflow2.13.0U-Net Medical3.89 DICE Score364 images/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed8EM segmentation challengeA10
2.13.0Electra Fine Tuning592.78 F1790 sequences/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed16SQuaD v1.1A10
2.13.0SIM1.79 AUC2,313,420 samples/sec8x A10GIGABYTE G482-Z52-0023.08-py3Mixed16384Amazon ReviewsA10


Converged Training Performance of NVIDIA Data Center GPUs on Cloud

A100 Training Performance


Framework Framework Version Network Time to Train
(mins)		 Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
MXNet-ResNet-50 v1.57877.18 Top 125,491 images/sec8x A100Azure Standard_ND96amsr_A100_v423.06-py3Mixed256ImageNet2012A100-SXM4-80GB
-ResNet-50 v1.58577.06 Top 123,159 images/sec8x A100GCP A2-HIGHGPU-8G23.08-py3Mixed256ImageNet2012A100-SXM4-40GB

Converged Multi-Node Training Performance of NVIDIA Data Center GPUs

A100 Multi-Node Training Performance


Framework Framework Version Network Time to Train
(mins) 		Accuracy Throughput Number of Nodes Number of GPUs Server Container Precision Batch Size Dataset GPU Version
PyTorch 1.14.0a0 BERT-Large Pre-Training P1 1606 1.32 Final Loss 4,754.06 sequences/sec 1 8 Selene 23.01-py3 Mixed 256 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training P2 494 1.28 Final Loss 1,780.29 sequences/sec 1 8 Selene 23.01-py3 Mixed 32 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training E2E 1235 1.28 Final Loss - 1 8 Selene 23.01-py3 Mixed 32 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training P1 804 1.37 Final Loss 9,457.6 sequences/sec 2 16 Selene 23.01-py3 Mixed 256 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training P2 245 1.27 Final Loss 3,479.51 sequences/sec 2 16 Selene 23.01-py3 Mixed 32 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training E2E 618 1.27 Final Loss - 2 16 Selene 23.01-py3 Mixed 32 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training P1 209 1.42 Final Loss 37,052.87 sequences/sec 8 64 Selene 23.01-py3 Mixed 256 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training P2 64 1.28 Final Loss 13,805.48 sequences/sec 8 64 Selene 23.01-py3 Mixed 32 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training E2E 160 1.28 Final Loss - 8 64 Selene 23.01-py3 Mixed 32 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training P1 119 1.41 Final Loss 71,775.3 sequences/sec 16 128 Selene 23.01-py3 Mixed 256 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training P2 35 1.28 Final Loss 25,005.24 sequences/sec 16 128 Selene 23.01-py3 Mixed 32 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training E2E 91 1.28 Final Loss - 16 128 Selene 23.01-py3 Mixed 32 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training P1 56 1.41 Final Loss 138,802.3 sequences/sec 32 256 Selene 23.01-py3 Mixed 256 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training P2 18 1.27 Final Loss 51,148.79 sequences/sec 32 256 Selene 23.01-py3 Mixed 32 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training E2E 44 1.27 Final Loss - 32 256 Selene 23.01-py3 Mixed 32 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training P1 32 1.4 Final Loss 245,286.91 sequences/sec 64 512 Selene 23.01-py3 Mixed 128 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training P2 11 1.26 Final Loss 89,069.47 sequences/sec 64 512 Selene 23.01-py3 Mixed 32 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training E2E 25 1.26 Final Loss - 64 512 Selene 23.01-py3 Mixed 32 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training P1 23 1.4 Final Loss 343,139.17 sequences/sec 128 1024 Selene 23.01-py3 Mixed 64 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training P2 8 1.24 Final Loss 143,583.49 sequences/sec 128 1024 Selene 23.01-py3 Mixed 32 SQuaD v1.1 A100-SXM4-80GB
1.14.0a0 BERT-Large Pre-Training E2E 18 1.24 Final Loss - 128 1024 Selene 23.01-py3 Mixed 32 SQuaD v1.1 A100-SXM4-80GB

BERT-Large Pre-Training Phase 1 Sequence Length = 128
BERT-Large Pre-Training Phase 2 Sequence Length = 512



