Review the latest GPU acceleration factors of popular HPC applications.

Please refer to Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide for instructions on how to reproduce these performance claims.


Deploying AI in real world applications, requires training the networks to convergence at a specified accuracy. This is the best methodology to test AI systems- where they are ready to be deployed in the field, as the networks can then deliver meaningful results (for example, correctly performing image recognition on video streams). Training that does not converge is a measurement of hardware’s throughput capabilities on the specified AI network, but is not representative of real world applications.

NVIDIA’s complete solution stack, from GPUs to libraries, and containers on NVIDIA GPU Cloud (NGC), allows data scientists to quickly get up and running with deep learning. NVIDIA® A100 Tensor Core GPUs provides unprecedented acceleration at every scale, setting records in MLPerf, the AI industry’s leading benchmark and a testament to our accelerated platform approach.

NVIDIA Performance on MLPerf 0.7 AI Benchmarks

BERT Time to Train on A100

PyTorch | Precision: Mixed | Dataset: Wikipedia 2020/01/01 | Convergence criteria - refer to MLPerf requirements

MLPerf Training Performance

NVIDIA A100 Performance on MLPerf 0.7 AI Benchmarks

Framework Network Time to Train (mins) GPU Server MLPerf-ID Precision Dataset GPU Version
MXNet ResNet-50 v1.5 39.78 8x A100 DGX-A100 0.7-18 Mixed ImageNet2012 A100-SXM4-40GB
23.75 16x A100 DGX-A100 0.7-21 Mixed ImageNet2012 A100-SXM4-40GB
1.06 768x A100 DGX-A100 0.7-32 Mixed ImageNet2012 A100-SXM4-40GB
0.83 1536x A100 DGX-A100 0.7-35 Mixed ImageNet2012 A100-SXM4-40GB
0.76 1840x A100 DGX-A100 0.7-37 Mixed ImageNet2012 A100-SXM4-40GB
SSD 2.25 64x A100 DGX-A100 0.7-25 Mixed COCO2017 A100-SXM4-40GB
0.89 512x A100 DGX-A100 0.7-31 Mixed COCO2017 A100-SXM4-40GB
0.82 1024x A100 DGX-A100 0.7-33 Mixed COCO2017 A100-SXM4-40GB
PyTorch BERT 49.01 8x A100 DGX-A100 0.7-19 Mixed Wikipedia 2020/01/01 A100-SXM4-40GB
30.63 16x A100 DGX-A100 0.7-22 Mixed Wikipedia 2020/01/01 A100-SXM4-40GB
3.36 256x A100 DGX-A100 0.7-28 Mixed Wikipedia 2020/01/01 A100-SXM4-40GB
1.48 1024x A100 DGX-A100 0.7-34 Mixed Wikipedia 2020/01/01 A100-SXM4-40GB
0.81 2048x A100 DGX-A100 0.7-38 Mixed Wikipedia 2020/01/01 A100-SXM4-40GB
DLRM 4.43 8x A100 DGX-A100 0.7-19 Mixed Criteo AI Lab’s Terabyte Click-Through-Rate (CTR) A100-SXM4-40GB
GNMT 7.81 8x A100 DGX-A100 0.7-19 Mixed WMT16 English-German A100-SXM4-40GB
4.94 16x A100 DGX-A100 0.7-22 Mixed WMT16 English-German A100-SXM4-40GB
0.98 256x A100 DGX-A100 0.7-28 Mixed WMT16 English-German A100-SXM4-40GB
0.71 1024x A100 DGX-A100 0.7-34 Mixed WMT16 English-German A100-SXM4-40GB
Mask R-CNN 82.16 8x A100 DGX-A100 0.7-19 Mixed COCO2017 A100-SXM4-40GB
44.21 16x A100 DGX-A100 0.7-22 Mixed COCO2017 A100-SXM4-40GB
28.46 32x A100 DGX-A100 0.7-24 Mixed COCO2017 A100-SXM4-40GB
10.46 256x A100 DGX-A100 0.7-28 Mixed COCO2017 A100-SXM4-40GB
SSD 10.21 8x A100 DGX-A100 0.7-19 Mixed COCO2017 A100-SXM4-40GB
5.68 16x A100 DGX-A100 0.7-22 Mixed COCO2017 A100-SXM4-40GB
Transformer 7.84 8x A100 DGX-A100 0.7-19 Mixed WMT17 English-German A100-SXM4-40GB
4.35 16x A100 DGX-A100 0.7-22 Mixed WMT17 English-German A100-SXM4-40GB
1.8 80x A100 DGX-A100 0.7-26 Mixed WMT17 English-German A100-SXM4-40GB
1.02 160x A100 DGX-A100 0.7-27 Mixed WMT17 English-German A100-SXM4-40GB
0.62 480x A100 DGX-A100 0.7-30 Mixed WMT17 English-German A100-SXM4-40GB
TensorFlow MiniGo 299.73 8x A100 DGX-A100 0.7-20 Mixed N/A A100-SXM4-40GB
165.72 16x A100 DGX-A100 0.7-23 Mixed N/A A100-SXM4-40GB
29.7 256x A100 DGX-A100 0.7-29 Mixed N/A A100-SXM4-40GB
17.07 1792x A100 DGX-A100 0.7-36 Mixed N/A A100-SXM4-40GB
NVIDIA Merlin HugeCTR DLRM 3.33 8x A100 DGX-A100 0.7-17 Mixed Criteo AI Lab’s Terabyte Click-Through-Rate (CTR) A100-SXM4-40GB

Training Natural Language Processing

BERT Pre-Training Throughput

DGX-A100 server w/ 8x NVIDIA A100 on PyTorch | DGX-1 server w/ 8x NVIDIA V100 on PyTorch (2/3)Phase 1 and (1/3)Phase 2 | Precision: FP16 for A100 and Mixed for V100 | Sequence Length for Phase 1= 128 and Phase 2 = 512

NVIDIA A100 BERT Training Benchmarks

Framework Network Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch BERT Pre-Training 2,274 sequences/sec 8x A100 DGX-A100 - FP16 - Wikipedia+BookCorpus A100 SXM4-40GB

DGX-A100 server w/ 8x NVIDIA A100 on PyTorch (2/3)Phase 1 and (1/3)Phase 2 | Sequence Length for Phase 1 = 128 and Phase 2 = 512

Converged Training Performance

A100 Training Performance

Framework Network Time to Train (mins) Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch Mask R-CNN 189.33 0.38 AP BBox 155 images/sec 8x A100 DGX-A100 20.08-py3 TF32 8 COCO 2014 A100-SXM4-40GB
ResNet-50 V1.5 197.67 76.96 Top1 9,942 images/sec 8x A100 DGX-A100 20.08-py3 Mixed 256 ImageNet2012 A100-SXM4-40GB
ResNeXt101 425.30 78.93 Top1 4,596 images/sec 8x A100 DGX-A100 20.08-py3 Mixed 256 Imagenet2012 A100-SXM4-40GB
SE-ResNeXt101 510.77 79.31 Top1 3,817 images/sec 8x A100 DGX-A100 20.07-py3 Mixed 256 Imagenet2012 A100-SXM4-40GB
SSD v1.1 44.25 0.25 mAP 3,085 images/sec 8x A100 DGX-A100 20.08-py3 Mixed 128 COCO 2017 A100-SXM4-40GB
Tacotron2 122.78 0.56 Training Loss 248,899 total output mels/sec 8x A100 DGX-A100 20.08-py3 TF32 128 LJSpeech 1.1 A100-SXM4-40GB
Transformer 123.28 27.7 BLEU Score 576,148 words/sec 8x A100 DGX-A100 20.06-py3 Mixed 10240 wmt14-en-de A100-SXM4-40GB
FastPitch 108.17 0.21 Training Loss 856,465 frames/sec 8x A100 DGX-A100 20.08-py3 Mixed 32 LJSpeech 1.1 A100-SXM4-40GB
GNMT V2 18 24.23 BLEU Score 1,065,072 total tokens/sec 8x A100 DGX-A100 20.06-py3 Mixed 256 wmt16-en-de A100-SXM4-40GB
BERT-LARGE 3.37 91.03 F1 876 sequences/sec 8x A100 DGX-A100 20.08-py3 Mixed 32 SQuaD v1.1 A100-SXM4-40GB
TensorFlow ResNet-50 V1.5 112.45 76.91 Top1 17,281 images/sec 8x A100 DGX-A100 20.08-py3 Mixed 256 ImageNet2012 A100-SXM4-40GB
Mask R-CNN 184.88 0.37 AP BBox 146 samples/sec 8x A100 DGX-A100 20.08-py3 Mixed 4 COCO 2017 A100-SXM4-40GB
U-Net Medical 4.37 0.89 DICE Score 810 images/sec 8x A100 DGX-A100 20.07-py3 Mixed 8 EM segmentation challenge A100-SXM4-40GB
BERT-LARGE 11.40 90.85 F1 719 sequences/sec 8x A100 DGX-A100 20.07-py3 Mixed 24 SQuaD v1.1 A100-SXM4-40GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec

V100 Training Performance

Framework Network Time to Train (mins) Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
MXNet ResNet-50 v1.5 161.92 76.96 Top1 12,192 images/sec 8x V100 DGX-2H 20.08-py3 Mixed 256 ImageNet2012 V100-SXM3-32GB-H
PyTorch Mask R-CNN 276.75 0.38 AP BBox 106 images/sec 8x V100 DGX-2 20.08-py3 Mixed 8 COCO 2014 V100-SXM3-32GB
ResNet-50 V1.5 300.25 77.18 Top1 6,549 images/sec 8x V100 DGX-2 20.08-py3 Mixed 256 ImageNet2012 V100-SXM3-32GB
ResNeXt101 720.80 79.07 Top1 2,724 images/sec 8x V100 DGX-2 20.08-py3 Mixed 128 Imagenet2012 V100-SXM3-32GB
SE-ResNeXt101 876.85 79.28 Top1 2,231 images/sec 8x V100 DGX-2 20.07-py3 Mixed 128 Imagenet2012 V100-SXM3-32GB
Tacotron2 239.72 0.54 Training Loss 133,014 total output mels/sec 8x V100 DGX-2 20.08-py3 Mixed 104 LJSpeech 1.1 V100-SXM3-32GB
WaveGlow 618.57 -5.74 Training Loss 677,363 output samples/sec 8x V100 DGX-2H 20.08-py3 Mixed 10 LJSpeech 1.1 V100-SXM3-32GB-H
Transformer 462.87 27.52 BLEU Score 214,856 words/sec 8x V100 DGX-2 20.08-py3 Mixed 5120 wmt14-en-de V100-SXM3-32GB
FastPitch 170.18 0.21 Training Loss 538,963 frames/sec 8x V100 DGX-2 20.08-py3 Mixed 32 LJSpeech 1.1 V100-SXM3-32GB
GNMT V2 31 24.3 BLEU Score 483,630 total tokens/sec 8x V100 DGX-2 20.08-py3 Mixed 128 wmt16-en-de V100-SXM3-32GB
NCF 0.52 0.96 Hit Rate at 10 107,510,258 samples/sec 8x V100 DGX-2H 20.08-py3 Mixed 131072 MovieLens 20M V100-SXM3-32GB-H
BERT-LARGE 8.02 91.18 F1 368 sequences/sec 8x V100 DGX-2 20.08-py3 Mixed 10 SQuaD v1.1 V100-SXM3-32GB
TensorFlow ResNet-50 V1.5 202.37 76.63 Top1 9,550 images/sec 8x V100 DGX-2 20.08-py3 Mixed 256 ImageNet2012 V100-SXM3-32GB
SSD v1.2 58.53 0.29 mAP 1,169 images/sec 8x V100 DGX-2H 20.08-py3 Mixed 32 COCO 2017 V100-SXM3-32GB-H
Mask R-CNN 295.75 0.38 AP BBox 87 samples/sec 8x V100 DGX-2 20.08-py3 Mixed 4 COCO 2017 V100-SXM3-32GB
U-Net Industrial 1.13 0.99 IoU Threshold 0.95 576 images/sec 8x V100 DGX-2H 20.08-py3 Mixed 2 DAGM2007 V100-SXM3-32GB-H
U-Net Medical 4.72 0.89 DICE Score 442 images/sec 8x V100 DGX-2H 20.08-py3 Mixed 8 EM segmentation challenge V100-SXM3-32GB-H
GNMT V2 157.75 24.3 BLEU Score 171,381 total tokens/sec 8x V100 DGX-2H 20.08-py3 Mixed 128 wmt16-en-de V100-SXM3-32GB-H
BERT-LARGE 19.63 90.86 F1 319 sequences/sec 8x V100 DGX-2 20.08-py3 Mixed 10 SQuaD v1.1 V100-SXM3-32GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec

T4 Training Performance

Framework Network Time to Train (mins) Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
MXNet ResNet-50 v1.5 508.07 77.12 Top1 3,841 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 192 ImageNet2012 NVIDIA T4
PyTorch Mask R-CNN 551.18 0.38 AP BBox 44 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 4 COCO 2014 NVIDIA T4
ResNeXt101 1758.60 79.07 Top1 1,110 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 128 Imagenet2012 NVIDIA T4
Tacotron2 273.12 0.54 Training Loss 113,999 total output mels/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 104 LJSpeech 1.1 NVIDIA T4
WaveGlow 1140.48 -5.6 Training Loss 366,998 output samples/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 10 LJSpeech 1.1 NVIDIA T4
Transformer 806.95 27.68 BLEU Score 83,760 words/sec 8x T4 Supermicro SYS-4029GP-TRT 20.07-py3 Mixed 5120 wmt14-en-de NVIDIA T4
FastPitch 325.15 0.21 Training Loss 278,184 frames/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 32 LJSpeech 1.1 NVIDIA T4
GNMT V2 91 24.28 BLEU Score 158,450 total tokens/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 128 wmt16-en-de NVIDIA T4
NCF 2.05 0.96 Hit Rate at 10 25,264,247 samples/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 131072 MovieLens 20M NVIDIA T4
BERT-LARGE 23.25 91.25 F1 127 sequences/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 10 SQuaD v1.1 NVIDIA T4
TensorFlow ResNet-50 V1.5 639.37 77.08 Top1 3,010 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 256 ImageNet2012 NVIDIA T4
SSD v1.2 109.07 0.28 mAP 578 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 32 COCO 2017 NVIDIA T4
U-Net Industrial 2.40 0.99 IoU Threshold 0.95 274 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 2 DAGM2007 NVIDIA T4
U-Net Medical 13.57 0.89 DICE Score 151 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 8 EM segmentation challenge NVIDIA T4
VAE-CF 10.20 0.43 NDCG@100 75,380 users processed/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 FP32 3072 MovieLens 20M NVIDIA T4
GNMT V2 308.77 24.26 BLEU Score 64,567 total tokens/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 128 wmt16-en-de NVIDIA T4
BERT-LARGE 58.53 90.96 F1 58 sequences/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 3 SQuaD v1.1 NVIDIA T4

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec


Deploying AI in real world applications, requires training the networks to convergence at a specified accuracy. This is the best methodology to test AI systems, and is typically done on multi-accelerator systems (see the ‘Training-Convergence’ tab) to shorten training-to-convergence times, especially for recurrent monthly container builds.

Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.

NVIDIA’s complete solution stack, from hardware to software, allows data scientists to deliver unprecedented acceleration at every scale. Visit NVIDIA GPU Cloud (NGC) to pull containers and quickly get up and running with deep learning.

Single GPU Training Performance

A100 Training Performance

Framework Network Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch Mask R-CNN 25 images/sec 1x A100 DGX-A100 20.08-py3 TF32 8 COCO 2014 A100-SXM4-40GB
ResNet-50 V1.5 1,280 images/sec 1x A100 DGX-A100 20.08-py3 Mixed 256 ImageNet2012 A100-SXM4-40GB
ResNeXt101 582 images/sec 1x A100 DGX-A100 20.08-py3 Mixed 256 Imagenet2012 A100-SXM4-40GB
SE-ResNeXt101 490 images/sec 1x A100 DGX-A100 20.08-py3 Mixed 256 Imagenet2012 A100-SXM4-40GB
SSD v1.1 401 images/sec 1x A100 DGX-A100 20.08-py3 Mixed 128 COCO 2017 A100-SXM4-40GB
Tacotron2 32,287 total output mels/sec 1x A100 DGX-A100 20.08-py3 TF32 128 LJSpeech 1.1 A100-SXM4-40GB
Transformer 81,674 words/sec 1x A100 DGX-A100 20.06-py3 Mixed 10240 wmt14-en-de A100-SXM4-40GB
FastPitch 168,763 frames/sec 1x A100 DGX-A100 20.08-py3 Mixed 128 LJSpeech 1.1 A100-SXM4-40GB
GNMT V2 146,872 total tokens/sec 1x A100 DGX-A100 20.08-py3 Mixed 128 wmt16-en-de A100-SXM4-40GB
NCF 32,493,781 samples/sec 1x A100 DGX-A100 20.08-py3 Mixed 1048576 MovieLens 20M A100-SXM4-40GB
BERT-LARGE 116 sequences/sec 1x A100 DGX-A100 20.08-py3 Mixed 32 SQuaD v1.1 A100-SXM4-40GB
TensorFlow U-Net Industrial 270 images/sec 1x A100 DGX-A100 20.08-py3 Mixed 16 DAGM2007 A100-SXM4-40GB
U-Net Medical 132 images/sec 1x A100 DGX-A100 20.08-py3 Mixed 8 EM segmentation challenge A100-SXM4-40GB
VAE-CF 365,343 users processed/sec 1x A100 DGX-A100 20.08-py3 Mixed 24576 MovieLens 20M A100-SXM4-40GB
BERT-LARGE 96 sequences/sec 1x A100 DGX-A100 20.08-py3 Mixed 24 SQuaD v1.1 A100-SXM4-40GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec

V100 Training Performance

Framework Network Throughput GPU Server Container Precision Batch Size Dataset GPU Version
MXNet ResNet-50 v1.5 1,625 images/sec 1x V100 DGX-2H 20.08-py3 Mixed 256 ImageNet2012 V100-SXM3-32GB-H
PyTorch Mask R-CNN 17 images/sec 1x V100 DGX-2 20.08-py3 Mixed 8 COCO 2014 V100-SXM3-32GB
ResNet-50 V1.5 879 images/sec 1x V100 DGX-2 20.08-py3 Mixed 256 ImageNet2012 V100-SXM3-32GB
ResNeXt101 368 images/sec 1x V100 DGX-2 20.08-py3 Mixed 128 Imagenet2012 V100-SXM3-32GB
SE-ResNeXt101 307 images/sec 1x V100 DGX-2 20.08-py3 Mixed 128 Imagenet2012 V100-SXM3-32GB
SSD v1.1 244 images/sec 1x V100 DGX-2 20.08-py3 Mixed 64 COCO 2017 V100-SXM3-32GB
Tacotron2 18,281 total output mels/sec 1x V100 DGX-2 20.08-py3 Mixed 104 LJSpeech 1.1 V100-SXM3-32GB
WaveGlow 121,976 output samples/sec 1x V100 DGX-2H 20.08-py3 Mixed 10 LJSpeech 1.1 V100-SXM3-32GB-H
Transformer 38,119 words/sec 1x V100 DGX-2 20.06-py3 Mixed 5120 wmt14-en-de V100-SXM3-32GB
FastPitch 110,786 frames/sec 1x V100 DGX-2 20.08-py3 Mixed 64 LJSpeech 1.1 V100-SXM3-32GB
GNMT V2 83,166 total tokens/sec 1x V100 DGX-2 20.08-py3 Mixed 128 wmt16-en-de V100-SXM3-32GB
NCF 21,941,715 samples/sec 1x V100 DGX-2 20.08-py3 Mixed 1048576 MovieLens 20M V100-SXM3-32GB
BERT-LARGE 53 sequences/sec 1x V100 DGX-2 20.08-py3 Mixed 10 SQuaD v1.1 V100-SXM3-32GB
TensorFlow Mask R-CNN 15 samples/sec 1x V100 DGX-2H 20.08-py3 Mixed 4 COCO 2017 V100-SXM3-32GB-H
ResNet-50 V1.5 1,401 images/sec 1x V100 DGX-2H 20.08-py3 Mixed 256 ImageNet2012 V100-SXM3-32GB-H
SSD v1.2 259 images/sec 1x V100 DGX-2H 20.08-py3 Mixed 32 COCO 2017 V100-SXM3-32GB-H
U-Net Industrial 105 images/sec 1x V100 DGX-2 20.08-py3 Mixed 16 DAGM2007 V100-SXM3-32GB
U-Net Medical 64 images/sec 1x V100 DGX-2 20.08-py3 Mixed 8 EM segmentation challenge V100-SXM3-32GB
VAE-CF 225,056 users processed/sec 1x V100 DGX-2 20.08-py3 Mixed 24576 MovieLens 20M V100-SXM3-32GB
Wide and Deep 312,776 samples/sec 1x V100 DGX-2H 20.08-py3 Mixed 131072 Kaggle Outbrain Click Prediction V100-SXM3-32GB-H
GNMT V2 25,399 total tokens/sec 1x V100 DGX-2H 20.08-py3 Mixed 128 wmt16-en-de V100-SXM3-32GB-H
NCF 25,694,022 samples/sec 1x V100 DGX-2 20.08-py3 Mixed 1048576 MovieLens 20M V100-SXM3-32GB
BERT-LARGE 44 sequences/sec 1x V100 DGX-2 20.08-py3 Mixed 10 SQuaD v1.1 V100-SXM3-32GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec

T4 Training Performance

Framework Network Throughput GPU Server Container Precision Batch Size Dataset GPU Version
MXNet ResNet-50 v1.5 511 images/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 192 ImageNet2012 NVIDIA T4
PyTorch Mask R-CNN 8 images/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 4 COCO 2014 NVIDIA T4
ResNet-50 V1.5 284 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 256 ImageNet2012 NVIDIA T4
ResNeXt101 144 images/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 128 Imagenet2012 NVIDIA T4
SE-ResNeXt101 123 images/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 128 Imagenet2012 NVIDIA T4
SSD v1.1 83 images/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 64 COCO 2017 NVIDIA T4
Tacotron2 15,376 total output mels/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 104 LJSpeech 1.1 NVIDIA T4
WaveGlow 51,491 output samples/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 10 LJSpeech 1.1 NVIDIA T4
Transformer 13,394 words/sec 1x T4 Supermicro SYS-4029GP-TRT 20.06-py3 Mixed 5120 wmt14-en-de NVIDIA T4
FastPitch 42,844 frames/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 64 LJSpeech 1.1 NVIDIA T4
GNMT V2 31,566 total tokens/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 128 wmt16-en-de NVIDIA T4
NCF 8,013,792 samples/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 1048576 MovieLens 20M NVIDIA T4
BERT-LARGE 18 sequences/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 10 SQuaD v1.1 NVIDIA T4
TensorFlow ResNet-50 V1.5 388 images/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 256 ImageNet2012 NVIDIA T4
SSD v1.2 95 images/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 32 COCO 2017 NVIDIA T4
U-Net Industrial 40 images/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 16 DAGM2007 NVIDIA T4
U-Net Medical 21 images/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 8 EM segmentation challenge NVIDIA T4
VAE-CF 81,247 users processed/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 24576 MovieLens 20M NVIDIA T4
GNMT V2 9,929 total tokens/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 128 wmt16-en-de NVIDIA T4
NCF 10,302,138 samples/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 1048576 MovieLens 20M NVIDIA T4
BERT-LARGE 11 sequences/sec 1x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 3 SQuaD v1.1 NVIDIA T4

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec

 

Real-world AI inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry leading solution enables customers to quickly deploy AI models into real-world production with the highest performance from data centers to the edge.

NVIDIA landed top performance spots on all MLPerf Inference 0.7 tests, the AI-industry’s leading benchmark. NVIDIA® TensorRT™ running on NVIDIA Tensor Core GPUs enable the most efficient deep learning inference performance across multiple application areas and models. This versatility provides wide latitude to data scientists to create the optimal low-latency solution. Visit NVIDIA GPU Cloud (NGC) to download any of these containers and immediately race into production. The inference whitepaper provides and overview of inference platforms.

Measuring inference performance involves balancing a lot of variables. PLASTER is an acronym that describes the key elements for measuring deep learning performance. Each letter identifies a factor (Programmability, Latency, Accuracy, Size of Model, Throughput, Energy Efficiency, Rate of Learning) that must be considered to arrive at the right set of tradeoffs and to produce a successful deep learning implementation. Refer to the PLASTER whitepaper for more details.


MLPerf Inference v0.7 Performance Benchmarks

Offline Scenario

Network Throughput GPU Server Dataset GPU Version
ResNet-50 v1.5 298,647 samples/sec 8x A100 NVIDIA DGX-A100 ImageNet A100-SXM4-40GB
48,898 samples/sec 8x T4 Supermicro 4029GP-TRT-OTO-28 ImageNet NVIDIA T4
SSD ResNet-34 7,788 samples/sec 8x A100 NVIDIA DGX-A100 COCO A100-SXM4-40GB
1,112 samples/sec 8x T4 Supermicro 4029GP-TRT-OTO-28 COCO NVIDIA T4
3D-UNet 328 samples/sec 8x A100 NVIDIA DGX-A100 BraTS 2019 A100-SXM4-40GB
58 samples/sec 8x T4 Supermicro 4029GP-TRT-OTO-28 BraTS 2019 NVIDIA T4
RNN-T 82,401 samples/sec 8x A100 NVIDIA DGX-A100 LibriSpeech A100-SXM4-40GB
11,963 samples/sec 8x T4 Supermicro 4029GP-TRT-OTO-28 LibriSpeech NVIDIA T4
BERT 26,625 samples/sec 8x A100 NVIDIA DGX-A100 SQuAD v1.1 A100-SXM4-40GB
3,495 samples/sec 8x T4 Supermicro 4029GP-TRT-OTO-28 SQuAD v1.1 NVIDIA T4
DLRM 2,113,510 samples/sec 8x A100 NVIDIA DGX-A100 Criteo 1TB Click Logs A100-SXM4-40GB
272,416 samples/sec 8x T4 Supermicro 4029GP-TRT-OTO-28 Criteo 1TB Click Logs NVIDIA T4

MLPerf v0.7 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, 3D U-Net 99.9% accuracy target, RNN-T, BERT 99% accuracy target, DLRM 99.9% accuracy target: 0.7-111, 0.7-113. MLPerf name and logo are trademarks. See www.mlperf.org for more information.

 

Inference Performance of NVIDIA A100, V100 and T4

Inference Natural Langugage Processing

BERT Inference Throughput

DGX-A100 server w/ 1x NVIDIA A100 with 7 MIG instances of 1g.5gb | Batch Size = 94 | Precision: INT8 | Sequence Length = 128
DGX-1 server w/ 1x NVIDIA V100 | TensorRT 7.1 | Batch Size = 256 | Precision: Mixed | Sequence Length = 128

 

NVIDIA A100 BERT Inference Benchmarks

Network Network
Type
Batch
Size
Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
BERT-Large with Sparsity Attention 94 6,188 sequences/sec - - 1x A100 DGX-A100 - INT8 SQuaD v1.1 - A100 SXM4-40GB

A100 with 7 MIG instances of 1g.5gb | Sequence length=128 | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

Inference Image Classification on CNNs with TensorRT

ResNet-50 v1.5 Throughput

DGX-A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM4-40GB | TensorRT 7.1.3 | Batch Size = 128 | 20.08-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 7.1.3 | Batch Size = 128 | 20.08-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 7.1.3 | Batch Size = 128 | 20.08-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Latency

DGX-A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM4-40GB | TensorRT 7.1.3 | Batch Size = 1 | 20.08-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 7.1.3 | Batch Size = 1 | 20.08-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 7.1.3 | Batch Size = 1 | 20.08-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Power Efficiency

DGX-A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM4-40GB | TensorRT 7.1.3 | Batch Size = 128 | 20.08-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 7.1.3 | Batch Size = 128 | 20.08-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 7.1.3 | Batch Size = 128 | 20.08-py3 | Precision: INT8 | Dataset: Synthetic

 

A100 Inference Performance

Network Batch Size 1/7 MIG Throughput 7 MIG Throughput Full Chip Throughput GPU Server Container Precision Dataset Framework GPU Version
BERT-Large 1 240 sequences/sec 1,680 sequences/sec 625 sequences/sec 1x A100 DGX-A100 - INT8 SQuaD v1.1 TensorRT 7.1 A100-SXM4-40GB
256 574 sequences/sec 4,018 sequences/sec 4,125 sequences/sec 1x A100 DGX-A100 - INT8 SQuaD v1.1 TensorRT 7.1 A100-SXM4-40GB
Jasper 1 115 inferences/sec 804 inferences/sec 227 inferences/sec 1x A100 DGX-A100 - FP16 LibriSpeech TensorRT 7.1 A100-SXM4-40GB
64 176 inferences/sec 1,230 inferences/sec 1,225 inferences/sec 1x A100 DGX-A100 - FP16 LibriSpeech TensorRT 7.1 A100-SXM4-40GB
ResNet-50v1.5 128 - - 23,617 images/sec 1x A100 DGX-A100 20.08-py3 INT8 Synthetic TensorRT 7.1.3 A100-SXM4-40GB

Containers with a hyphen indicates a pre-release container
Sequence Length = 5.12s for Jasper

 

V100 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50 1 1,216 images/sec 7.34 images/sec/watt 0.82 1x V100 DGX-2 20.08-py3 INT8 Synthetic TensorRT 7.1.3 V100-SXM3-32GB
2 1,857 images/sec 10.51 images/sec/watt 1.08 1x V100 DGX-2 20.08-py3 Mixed Synthetic TensorRT 7.1.3 V100-SXM3-32GB
8 4,169 images/sec 17.4 images/sec/watt 1.92 1x V100 DGX-2 20.08-py3 Mixed Synthetic TensorRT 7.1.3 V100-SXM3-32GB
52 7,772 images/sec 23 images/sec/watt 6.7 1x V100 DGX-2 20.08-py3 Mixed Synthetic TensorRT 7.1.3 V100-SXM3-32GB
128 7,819 images/sec 22.8 images/sec/watt 16.37 1x V100 DGX-2 20.08-py3 Mixed Synthetic TensorRT 7.1.3 V100-SXM3-32GB
ResNet-50v1.5 1 1,065 images/sec 7.21 images/sec/watt 0.94 1x V100 DGX-2 20.08-py3 Mixed Synthetic TensorRT 7.1.3 V100-SXM3-32GB
2 1,846 images/sec 10.41 images/sec/watt 1.08 1x V100 DGX-2 20.08-py3 Mixed Synthetic TensorRT 7.1.3 V100-SXM3-32GB
8 4,086 images/sec 16.27 images/sec/watt 1.96 1x V100 DGX-2 20.08-py3 Mixed Synthetic TensorRT 7.1.3 V100-SXM3-32GB
51 7,319 images/sec 21 images/sec/watt 7 1x V100 DGX-2 20.07-py3 Mixed Synthetic TensorRT 7.1.3 V100-SXM3-32GB
128 7,464 images/sec 21.89 images/sec/watt 17.15 1x V100 DGX-2 20.08-py3 Mixed Synthetic TensorRT 7.1.3 V100-SXM3-32GB
NCF 16384 30,774,508 samples/sec - samples/sec/watt - 1x V100 DGX-2 20.06-py3 Mixed MovieLens 20M PyTorch 1.6.0a0+9907a3e V100-SXM3-32GB
BERT-BASE 1 770 sequences/sec - sequences/sec/watt 1.3 1x V100 DGX-2 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB
2 1,307 sequences/sec - sequences/sec/watt 1.5 1x V100 DGX-2 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB
8 2,389 sequences/sec - sequences/sec/watt 3.4 1x V100 DGX-2 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB
26 3,002 sequences/sec 13.36 sequences/sec/watt 8.66 1x V100 DGX-2 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB
128 3,033 sequences/sec 11.82 sequences/sec/watt 42.2 1x V100 DGX-2 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB
BERT-LARGE 1 309 sequences/sec 2.8 sequences/sec/watt 3.2 1x V100 DGX-2 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB
2 507 sequences/sec 3.68 sequences/sec/watt 3.95 1x V100 DGX-2 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB
8 773 sequences/sec 3.8 sequences/sec/watt 10 1x V100 DGX-2 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB
128 1,043 sequences/sec 3.1 sequences/sec/watt 123 1x V100 DGX-2H 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB-H
Mask R-CNN 1 17 images/sec 0.08 images/sec/watt - 1x V100 DGX-2 20.08-py3 FP32 COCO 2014 PyTorch 1.7.0a0+8deb4fe V100-SXM3-32GB
2 19 images/sec 0.11 images/sec/watt - 1x V100 DGX-2 20.08-py3 Mixed COCO 2014 PyTorch 1.7.0a0+8deb4fe V100-SXM3-32GB
8 21 images/sec 0.13 images/sec/watt - 1x V100 DGX-2 20.08-py3 Mixed COCO 2014 PyTorch 1.7.0a0+8deb4fe V100-SXM3-32GB
Tacotron2 1 1,187 total output mels/sec - total output mels/sec/watt 1.68 1x V100 DGX-2 20.08-py3 FP32 LJSpeech 1.1 PyTorch 1.7.0a0+8deb4fe V100-SXM3-32GB
4 4,968 total output mels/sec - total output mels/sec/watt 1.6 1x V100 DGX-2 20.08-py3 FP32 LJSpeech 1.1 PyTorch 1.7.0a0+8deb4fe V100-SXM3-32GB
WaveGlow 1 612,782 output samples/sec - output samples/sec/watt 0.37 1x V100 DGX-2 20.08-py3 Mixed LJSpeech 1.1 PyTorch 1.7.0a0+8deb4fe V100-SXM3-32GB

NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

T4 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU
Version
ResNet-50 1 1,161 images/sec 16.72 images/sec/watt 0.86 1x T4 Supermicro SYS-1029GQ-TRT 20.08-py3 INT8 Synthetic TensorRT 7.1.3 NVIDIA T4
2 1,760 images/sec 25.47 images/sec/watt 1.14 1x T4 Supermicro SYS-1029GQ-TRT 20.08-py3 INT8 Synthetic TensorRT 7.1.3 NVIDIA T4
8 3,899 images/sec 56.02 images/sec/watt 2.05 1x T4 Supermicro SYS-1029GQ-TRT 20.08-py3 INT8 Synthetic TensorRT 7.1.3 NVIDIA T4
33 4,831 images/sec 69 images/sec/watt 6.8 1x T4 Supermicro SYS-1029GQ-TRT 20.08-py3 INT8 Synthetic TensorRT 7.1.3 NVIDIA T4
128 5,305 images/sec 75.83 images/sec/watt 24.13 1x T4 Supermicro SYS-1029GQ-TRT 20.08-py3 INT8 Synthetic TensorRT 7.1.3 NVIDIA T4
ResNet-50v1.5 1 1,108 images/sec 15.77 images/sec/watt 0.9 1x T4 Supermicro SYS-1029GQ-TRT 20.08-py3 INT8 Synthetic TensorRT 7.1.3 NVIDIA T4
2 1,671 images/sec 24 images/sec/watt 1.2 1x T4 Supermicro SYS-1029GQ-TRT 20.08-py3 INT8 Synthetic TensorRT 7.1.3 NVIDIA T4
8 3,797 images/sec 54.5 images/sec/watt 2.11 1x T4 Supermicro SYS-1029GQ-TRT 20.08-py3 INT8 Synthetic TensorRT 7.1.3 NVIDIA T4
31 4,652 images/sec 67 images/sec/watt 6.7 1x T4 Supermicro SYS-1029GQ-TRT 20.08-py3 INT8 Synthetic TensorRT 7.1.3 NVIDIA T4
128 4,979 images/sec 71.19 images/sec/watt 25.71 1x T4 Supermicro SYS-1029GQ-TRT 20.08-py3 INT8 Synthetic TensorRT 7.1.3 NVIDIA T4
NCF 16384 21,196,631 samples/sec 430,872.65 samples/sec/watt 0 1x T4 Supermicro SYS-1029GQ-TRT 20.06-py3 Mixed MovieLens 20M PyTorch 1.6.0a0+9907a3e NVIDIA T4
BERT-BASE 1 521 sequences/sec - sequences/sec/watt 1.92 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Sample Text TensorRT 7.0.0 NVIDIA T4
2 691 sequences/sec - sequences/sec/watt 2.9 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Sample Text TensorRT 7.0.0 NVIDIA T4
8 820 sequences/sec 11.88 sequences/sec/watt 9.75 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Sample Text TensorRT 7.0.0 NVIDIA T4
128 852 sequences/sec 12.27 sequences/sec/watt 150.31 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Sample Text TensorRT 7.0.0 NVIDIA T4
BERT-LARGE 1 186 sequences/sec 2.97 sequences/sec/watt 5.37 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Sample Text TensorRT 7.0.0 NVIDIA T4
2 225 sequences/sec 3.41 sequences/sec/watt 8.88 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Sample Text TensorRT 7.0.0 NVIDIA T4
8 237 sequences/sec 3.51 sequences/sec/watt 33.75 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Sample Text TensorRT 7.0.0 NVIDIA T4
128 265 sequences/sec 3.8 sequences/sec/watt 483 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Sample Text TensorRT 7.0.0 NVIDIA T4
Mask R-CNN 1 11 images/sec 0.16 images/sec/watt - 1x T4 Supermicro SYS-1029GQ-TRT 20.08-py3 Mixed COCO 2014 PyTorch 1.7.0a0+8deb4fe NVIDIA T4
Tacotron2 1 1,478 total output mels/sec - total output mels/sec/watt 1.35 1x T4 Supermicro SYS-1029GQ-TRT 20.08-py3 FP32 LJSpeech 1.1 PyTorch 1.7.0a0+8deb4fe NVIDIA T4
4 5,815 total output mels/sec - total output mels/sec/watt 1.4 1x T4 Supermicro SYS-1029GQ-TRT 20.08-py3 FP32 LJSpeech 1.1 PyTorch 1.7.0a0+8deb4fe NVIDIA T4
WaveGlow 1 189,566 output samples/sec - output samples/sec/watt 1.21 1x T4 Supermicro SYS-1029GQ-TRT 20.08-py3 Mixed LJSpeech 1.1 PyTorch 1.7.0a0+8deb4fe NVIDIA T4

NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

Last updated: October 12th, 2020