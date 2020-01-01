Training to Convergence
Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.
Click here to view other performance data.
NVIDIA Performance on MLPerf 3.0 Training Benchmarks
NVIDIA Performance on MLPerf 3.0’s AI Benchmarks: Single Node, Closed Division
NVIDIA Performance on MLPerf 3.0’s AI Benchmarks: Multi Node, Closed Division
MLPerf™ v3.0 Training Closed: 3.0-2000, 3.0-2001, 3.0-2002, 3.0-2003, 3.0-2053, 3.0-2054,3.0-2055, 3.0-2063, 3.0-2064, 3.0-2065, 3.0-2066, 3.0-2067, 3.0-2068, 3.0-2069, 3.0-2070, 3.0-2071, 3.0-2072, 3.0-2073, 3.0-2074, 3.0-2075, 3.0-2076, 3.0-2077 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
NVIDIA Performance on MLPerf 2.0’s Training HPC Benchmarks: Strong Scaling, Closed Division
NVIDIA Performance on MLPerf 2.0’s Training HPC Benchmarks: Weak Scaling, Closed Division
|Framework
|Network
|Throughput
|MLPerf Quality Target
|GPU
|Server
|MLPerf-ID
|Precision
|Dataset
|GPU Version
|PyTorch
|CosmoFlow
|4.21 models/min
|Mean average error 0.124
|4,096x A100
|DGX A100
|2.0-8014
|Mixed
|CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets
|A100-SXM4-80GB
|DeepCAM
|6.40 models/min
|IOU 0.82
|4,096x A100
|DGX A100
|2.0-8014
|Mixed
|CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)
|A100-SXM4-80GB
|OpenCatalyst
|0.66 models/min
|Forces mean absolute error 0.036
|4,096x A100
|DGX A100
|2.0-8014
|Mixed
|Open Catalyst 2020 (OC20) S2EF 2M training split, ID validation set
|A100-SXM4-80GB
MLPerf™ v2.0 Training HPC Closed: 2.0-8005, 2.0-8006, 2.0-8014 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v2.0 Training HPC rules and guidelines, click here
Converged Training Performance on NVIDIA Data Center GPUs
H100 Training Performance
|Framework
|Framework Version
|Network
|Time to Train
(mins)
|Accuracy
|Throughput
|GPU
|Server
|Container
|Precision
|Batch Size
|Dataset
|GPU Version
|PyTorch
|2.1.0a0
|Tacotron2
|61
|.56 Training Loss
|494,202 total output mels/sec
|8x H100
|DGX H100
|23.07-py3
|Mixed
|128
|LJSpeech 1.1
|H100 SXM5-80GB
|2.1.0a0
|WaveGlow
|119
|-5.7 Training Loss
|3,631,002 output samples/sec
|8x H100
|DGX H100
|23.08-py3
|Mixed
|10
|LJSpeech 1.1
|H100 SXM5-80GB
|2.1.0a0
|GNMT v2
|11
|24.36 BLEU Score
|1,679,675 total tokens/sec
|8x H100
|DGX H100
|23.08-py3
|Mixed
|128
|wmt16-en-de
|H100 SXM5-80GB
|2.1.0a0
|Transformer
|107
|27.77 BLEU Score
|945,566 Tokens per Second
|8x H100
|DGX H100
|23.08-py3
|Mixed
|10240
|wmt14-en-de
|H100 SXM5-80GB
|2.1.0a0
|EfficientNet-B4
|1,622
|81.82 Top 1
|5,375 images/sec
|8x H100
|DGX H100
|23.08-py3
|Mixed
|128
|Imagenet2012
|H100 SXM5-80GB
|2.1.0a0
|EfficientDet-D0
|327
|.33 BBOX mAP
|2,670 images/sec
|8x H100
|DGX H100
|23.08-py3
|Mixed
|150
|COCO 2017
|H100 SXM5-80GB
|2.1.0a0
|EfficientNet-WideSE-B4
|1,639
|82. Top 1
|5,323 images/sec
|8x H100
|DGX H100
|23.08-py3
|Mixed
|128
|Imagenet2012
|H100 SXM5-80GB
|2.1.0a0
|HiFiGAN
|1,035
|9.37 Training Loss
|105,894 total output mels/sec
|8x H100
|DGX H100
|23.07-py3
|Mixed
|16
|LJSpeech-1.1
|H100-SXM5-80GB
|Tensorflow
|2.12.0
|U-Net Medical
|1
|.89 DICE Score
|2,238 images/sec
|8x H100
|DGX H100
|23.07-py3
|Mixed
|8
|EM segmentation challenge
|H100 SXM5-80GB
|2.13.0
|Electra Fine Tuning
|2
|92.57 F1
|5,367 sequences/sec
|8x H100
|DGX H100
|23.08-py3
|Mixed
|32
|SQuaD v1.1
|H100 SXM5-80GB
|2.12.0
|Wide and Deep
|4
|.66 MAP at 12
|12,780,170 samples/sec
|8x H100
|DGX H100
|23.07-py3
|Mixed
|16384
|Kaggle Outbrain Click Prediction
|H100 SXM5-80GB
|Nemo
|-
|ViT g/14
|2,137
|70.2 Top 1
|1,610 images/sec
|8x H100
|DGX H100
|23.06-py3
|BF16
|8192
|ImageNet2012
|H100-SXM5-80GB
|-
|ViT H/14
|5,450
|75. Top 1
|2,359 images/sec
|8x H100
|DGX H100
|23.06-py3
|BF16
|8192
|ImageNet2012
|H100-SXM5-80GB
A40 Training Performance
|Framework
|Framework Version
|Network
|Time to Train
(mins)
|Accuracy
|Throughput
|GPU
|Server
|Container
|Precision
|Batch Size
|Dataset
|GPU Version
|PyTorch
|2.0.0a0
|NCF
|1
|.96 Hit Rate at 10
|49,380,246 samples/sec
|8x A40
|GIGABYTE G482-Z52-00
|23.05-py3
|Mixed
|131072
|MovieLens 20M
|A40
|2.0.0a0
|Tacotron2
|112
|.56 Training Loss
|271,434 total output mels/sec
|8x A40
|Supermicro AS -4124GS-TNR
|23.03-py3
|Mixed
|128
|LJSpeech 1.1
|A40
|2.1.0a0
|WaveGlow
|428
|-5.94 Training Loss
|986,709 output samples/sec
|8x A40
|Supermicro AS -4124GS-TNR
|23.07-py3
|Mixed
|10
|LJSpeech 1.1
|A40
|2.1.0a0
|GNMT v2
|46
|24.3 BLEU Score
|329,065 total tokens/sec
|8x A40
|Supermicro AS -4124GS-TNR
|23.07-py3
|Mixed
|128
|wmt16-en-de
|A40
|2.1.0a0
|FastPitch
|141
|.17 Training Loss
|617,166 frames/sec
|8x A40
|Supermicro AS -4124GS-TNR
|23.07-py3
|Mixed
|32
|LJSpeech 1.1
|A40
|2.1.0a0
|EfficientNet-B4
|6,390
|81.93 Top 1
|1,360 images/sec
|8x A40
|Supermicro AS -4124GS-TNR
|23.07-py3
|Mixed
|64
|Imagenet2012
|A40
|2.1.0a0
|EfficientNet-WideSE-B0
|855
|77.08 Top 1
|10,467 images/sec
|8x A40
|Supermicro AS -4124GS-TNR
|23.07-py3
|Mixed
|256
|Imagenet2012
|A40
|2.1.0a0
|EfficientDet-D0
|708
|.34 BBOX mAP
|1,112 images/sec
|8x A40
|Supermicro AS -4124GS-TNR
|23.07-py3
|Mixed
|60
|COCO 2017
|A40
|Tensorflow
|2.12.0
|SIM
|1
|.79 AUC
|2,450,911 samples/sec
|8x A40
|GIGABYTE G482-Z52-00
|23.07-py3
|Mixed
|16384
|Amazon Reviews
|A40
A30 Training Performance
|Framework
|Framework Version
|Network
|Time to Train
(mins)
|Accuracy
|Throughput
|GPU
|Server
|Container
|Precision
|Batch Size
|Dataset
|GPU Version
|PyTorch
|2.1.0a0
|Tacotron2
|131
|.53 Training Loss
|236,272 total output mels/sec
|8x A30
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|104
|LJSpeech 1.1
|A30
|2.1.0a0
|WaveGlow
|411
|-5.73 Training Loss
|1,030,817 output samples/sec
|8x A30
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|10
|LJSpeech 1.1
|A30
|2.1.0a0
|GNMT v2
|49
|24.14 BLEU Score
|310,790 total tokens/sec
|8x A30
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|128
|wmt16-en-de
|A30
|2.0.0
|NCF
|1
|.96 Hit Rate at 10
|51,737,678 samples/sec
|8x A30
|GIGABYTE G482-Z52-00
|23.05-py3
|Mixed
|131072
|MovieLens 20M
|A30
|2.1.0a0
|FastPitch
|159
|.17 Training Loss
|531,205 frames/sec
|8x A30
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|16
|LJSpeech 1.1
|A30
|2.1.0a0
|EfficientDet-D0
|933
|.34 BBOX mAP
|796 images/sec
|8x A30
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|30
|COCO 2017
|A30
|2.1.0a0
|EfficientNet-WideSE-B0
|823
|77.25 Top 1
|10,782 images/sec
|8x A30
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|128
|Imagenet2012
|A30
|Tensorflow
|2.13.0
|U-Net Medical
|4
|.89 DICE Score
|473 images/sec
|8x A30
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|8
|EM segmentation challenge
|A30
|2.13.0
|Electra Fine Tuning
|5
|92.8 F1
|1,019 sequences/sec
|8x A30
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|16
|SQuaD v1.1
|A30
|2.13.0
|SIM
|1
|.81 AUC
|2,573,991 samples/sec
|8x A30
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|16384
|Amazon Reviews
|A30
A10 Training Performance
|Framework
|Framework Version
|Network
|Time to Train
(mins)
|Accuracy
|Throughput
|GPU
|Server
|Container
|Precision
|Batch Size
|Dataset
|GPU Version
|PyTorch
|2.1.0a0
|Tacotron2
|142
|.53 Training Loss
|217,183 total output mels/sec
|8x A10
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|104
|LJSpeech 1.1
|A10
|2.1.0a0
|GNMT v2
|54
|24.17 BLEU Score
|281,565 total tokens/sec
|8x A10
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|128
|wmt16-en-de
|A10
|2.0.0
|NCF
|1
|.96 Hit Rate at 10
|42,446,575 samples/sec
|8x A10
|GIGABYTE G482-Z52-00
|23.05-py3
|Mixed
|131072
|MovieLens 20M
|A10
|2.1.0a0
|FastPitch
|189
|.17 Training Loss
|439,357 frames/sec
|8x A10
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|16
|LJSpeech 1.1
|A10
|2.1.0a0
|EfficientDet-D0
|964
|.34 BBOX mAP
|720 images/sec
|8x A10
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|30
|COCO 2017
|A10
|2.1.0a0
|EfficientNet-WideSE-B0
|1,080
|77.19 Top 1
|8,351 images/sec
|8x A10
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|128
|Imagenet2012
|A10
|Tensorflow
|2.13.0
|U-Net Medical
|3
|.89 DICE Score
|364 images/sec
|8x A10
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|8
|EM segmentation challenge
|A10
|2.13.0
|Electra Fine Tuning
|5
|92.78 F1
|790 sequences/sec
|8x A10
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|16
|SQuaD v1.1
|A10
|2.13.0
|SIM
|1
|.79 AUC
|2,313,420 samples/sec
|8x A10
|GIGABYTE G482-Z52-00
|23.08-py3
|Mixed
|16384
|Amazon Reviews
|A10
Converged Training Performance of NVIDIA Data Center GPUs on Cloud
A100 Training Performance
|Framework
|Framework Version
|Network
|Time to Train
(mins)
|Accuracy
|Throughput
|GPU
|Server
|Container
|Precision
|Batch Size
|Dataset
|GPU Version
|MXNet
|-
|ResNet-50 v1.5
|78
|77.18 Top 1
|25,491 images/sec
|8x A100
|Azure Standard_ND96amsr_A100_v4
|23.06-py3
|Mixed
|256
|ImageNet2012
|A100-SXM4-80GB
|-
|ResNet-50 v1.5
|85
|77.06 Top 1
|23,159 images/sec
|8x A100
|GCP A2-HIGHGPU-8G
|23.08-py3
|Mixed
|256
|ImageNet2012
|A100-SXM4-40GB
Converged Multi-Node Training Performance of NVIDIA Data Center GPUs
A100 Multi-Node Training Performance
|Framework
|Framework Version
|Network
|Time to Train
(mins)
|Accuracy
|Throughput
|Number of Nodes
|Number of GPUs
|Server
|Container
|Precision
|Batch Size
|Dataset
|GPU Version
|PyTorch
|1.14.0a0
|BERT-Large Pre-Training P1
|1606
|1.32 Final Loss
|4,754.06 sequences/sec
|1
|8
|Selene
|23.01-py3
|Mixed
|256
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training P2
|494
|1.28 Final Loss
|1,780.29 sequences/sec
|1
|8
|Selene
|23.01-py3
|Mixed
|32
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training E2E
|1235
|1.28 Final Loss
|-
|1
|8
|Selene
|23.01-py3
|Mixed
|32
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training P1
|804
|1.37 Final Loss
|9,457.6 sequences/sec
|2
|16
|Selene
|23.01-py3
|Mixed
|256
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training P2
|245
|1.27 Final Loss
|3,479.51 sequences/sec
|2
|16
|Selene
|23.01-py3
|Mixed
|32
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training E2E
|618
|1.27 Final Loss
|-
|2
|16
|Selene
|23.01-py3
|Mixed
|32
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training P1
|209
|1.42 Final Loss
|37,052.87 sequences/sec
|8
|64
|Selene
|23.01-py3
|Mixed
|256
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training P2
|64
|1.28 Final Loss
|13,805.48 sequences/sec
|8
|64
|Selene
|23.01-py3
|Mixed
|32
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training E2E
|160
|1.28 Final Loss
|-
|8
|64
|Selene
|23.01-py3
|Mixed
|32
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training P1
|119
|1.41 Final Loss
|71,775.3 sequences/sec
|16
|128
|Selene
|23.01-py3
|Mixed
|256
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training P2
|35
|1.28 Final Loss
|25,005.24 sequences/sec
|16
|128
|Selene
|23.01-py3
|Mixed
|32
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training E2E
|91
|1.28 Final Loss
|-
|16
|128
|Selene
|23.01-py3
|Mixed
|32
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training P1
|56
|1.41 Final Loss
|138,802.3 sequences/sec
|32
|256
|Selene
|23.01-py3
|Mixed
|256
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training P2
|18
|1.27 Final Loss
|51,148.79 sequences/sec
|32
|256
|Selene
|23.01-py3
|Mixed
|32
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training E2E
|44
|1.27 Final Loss
|-
|32
|256
|Selene
|23.01-py3
|Mixed
|32
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training P1
|32
|1.4 Final Loss
|245,286.91 sequences/sec
|64
|512
|Selene
|23.01-py3
|Mixed
|128
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training P2
|11
|1.26 Final Loss
|89,069.47 sequences/sec
|64
|512
|Selene
|23.01-py3
|Mixed
|32
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training E2E
|25
|1.26 Final Loss
|-
|64
|512
|Selene
|23.01-py3
|Mixed
|32
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training P1
|23
|1.4 Final Loss
|343,139.17 sequences/sec
|128
|1024
|Selene
|23.01-py3
|Mixed
|64
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training P2
|8
|1.24 Final Loss
|143,583.49 sequences/sec
|128
|1024
|Selene
|23.01-py3
|Mixed
|32
|SQuaD v1.1
|A100-SXM4-80GB
|1.14.0a0
|BERT-Large Pre-Training E2E
|18
|1.24 Final Loss
|-
|128
|1024
|Selene
|23.01-py3
|Mixed
|32
|SQuaD v1.1
|A100-SXM4-80GB
BERT-Large Pre-Training Phase 1 Sequence Length = 128
BERT-Large Pre-Training Phase 2 Sequence Length = 512
View More Performance Data
AI Inference
Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.Learn More
AI Pipeline
NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.Learn More