NVIDIA Data Center Deep Learning Product Performance
Review the latest GPU acceleration factors of popular HPC applications.
Please refer to Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide for instructions on how to reproduce these performance claims.
Deploying AI in real world applications, requires training the networks to convergence at a specified accuracy. This is the best methodology to test AI systems- where they are ready to be deployed in the field, as the networks can then deliver meaningful results (for example, correctly performing image recognition on video streams). Read our blog on convergence for more details. Training that does not converge is a measurement of hardware’s throughput capabilities on the specified AI network, but is not representative of real world applications.
NVIDIA’s complete solution stack, from GPUs to libraries, and containers on NVIDIA GPU Cloud (NGC), allows data scientists to quickly get up and running with deep learning. NVIDIA® A100 Tensor Core GPUs provides unprecedented acceleration at every scale, setting records in MLPerf, the AI industry’s leading benchmark and a testament to our accelerated platform approach.
NVIDIA Performance on MLPerf 0.7 AI Benchmarks
BERT Time to Train on A100
PyTorch | Precision: Mixed | Dataset: Wikipedia 2020/01/01 | Convergence criteria - refer to MLPerf requirements
MLPerf Training Performance
NVIDIA A100 Performance on MLPerf 0.7 AI Benchmarks
Framework | Network | Time to Train (mins) | MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|
MXNet | ResNet-50 v1.5 | 39.78 | 75.90% classification | 8x A100 | DGX-A100 | 0.7-18 | Mixed | ImageNet2012 | A100-SXM4-40GB |
23.75 | 75.90% classification | 16x A100 | DGX-A100 | 0.7-21 | Mixed | ImageNet2012 | A100-SXM4-40GB | ||
1.06 | 75.90% classification | 768x A100 | DGX-A100 | 0.7-32 | Mixed | ImageNet2012 | A100-SXM4-40GB | ||
0.83 | 75.90% classification | 1536x A100 | DGX-A100 | 0.7-35 | Mixed | ImageNet2012 | A100-SXM4-40GB | ||
0.76 | 75.90% classification | 1840x A100 | DGX-A100 | 0.7-37 | Mixed | ImageNet2012 | A100-SXM4-40GB | ||
SSD | 2.25 | 23.0% mAP | 64x A100 | DGX-A100 | 0.7-25 | Mixed | COCO2017 | A100-SXM4-40GB | |
0.89 | 23.0% mAP | 512x A100 | DGX-A100 | 0.7-31 | Mixed | COCO2017 | A100-SXM4-40GB | ||
0.82 | 23.0% mAP | 1024x A100 | DGX-A100 | 0.7-33 | Mixed | COCO2017 | A100-SXM4-40GB | ||
PyTorch | BERT | 49.01 | 0.712 Mask-LM accuracy | 8x A100 | DGX-A100 | 0.7-19 | Mixed | Wikipedia 2020/01/01 | A100-SXM4-40GB |
30.63 | 0.712 Mask-LM accuracy | 16x A100 | DGX-A100 | 0.7-22 | Mixed | Wikipedia 2020/01/01 | A100-SXM4-40GB | ||
3.36 | 0.712 Mask-LM accuracy | 256x A100 | DGX-A100 | 0.7-28 | Mixed | Wikipedia 2020/01/01 | A100-SXM4-40GB | ||
1.48 | 0.712 Mask-LM accuracy | 1024x A100 | DGX-A100 | 0.7-34 | Mixed | Wikipedia 2020/01/01 | A100-SXM4-40GB | ||
0.81 | 0.712 Mask-LM accuracy | 2048x A100 | DGX-A100 | 0.7-38 | Mixed | Wikipedia 2020/01/01 | A100-SXM4-40GB | ||
DLRM | 4.43 | 0.8025 AUC | 8x A100 | DGX-A100 | 0.7-19 | Mixed | Criteo AI Lab’s Terabyte Click-Through-Rate (CTR) | A100-SXM4-40GB | |
GNMT | 7.81 | 24.0 Sacre BLEU | 8x A100 | DGX-A100 | 0.7-19 | Mixed | WMT16 English-German | A100-SXM4-40GB | |
4.94 | 24.0 Sacre BLEU | 16x A100 | DGX-A100 | 0.7-22 | Mixed | WMT16 English-German | A100-SXM4-40GB | ||
0.98 | 24.0 Sacre BLEU | 256x A100 | DGX-A100 | 0.7-28 | Mixed | WMT16 English-German | A100-SXM4-40GB | ||
0.71 | 24.0 Sacre BLEU | 1024x A100 | DGX-A100 | 0.7-34 | Mixed | WMT16 English-German | A100-SXM4-40GB | ||
Mask R-CNN | 82.16 | 0.377 Box min AP and 0.339 Mask min AP | 8x A100 | DGX-A100 | 0.7-19 | Mixed | COCO2017 | A100-SXM4-40GB | |
44.21 | 0.377 Box min AP and 0.339 Mask min AP | 16x A100 | DGX-A100 | 0.7-22 | Mixed | COCO2017 | A100-SXM4-40GB | ||
28.46 | 0.377 Box min AP and 0.339 Mask min AP | 32x A100 | DGX-A100 | 0.7-24 | Mixed | COCO2017 | A100-SXM4-40GB | ||
10.46 | 0.377 Box min AP and 0.339 Mask min AP | 256x A100 | DGX-A100 | 0.7-28 | Mixed | COCO2017 | A100-SXM4-40GB | ||
SSD | 10.21 | 23.0% mAP | 8x A100 | DGX-A100 | 0.7-19 | Mixed | COCO2017 | A100-SXM4-40GB | |
5.68 | 23.0% mAP | 16x A100 | DGX-A100 | 0.7-22 | Mixed | COCO2017 | A100-SXM4-40GB | ||
Transformer | 7.84 | 25.00 BLEU | 8x A100 | DGX-A100 | 0.7-19 | Mixed | WMT17 English-German | A100-SXM4-40GB | |
4.35 | 25.00 BLEU | 16x A100 | DGX-A100 | 0.7-22 | Mixed | WMT17 English-German | A100-SXM4-40GB | ||
1.8 | 25.00 BLEU | 80x A100 | DGX-A100 | 0.7-26 | Mixed | WMT17 English-German | A100-SXM4-40GB | ||
1.02 | 25.00 BLEU | 160x A100 | DGX-A100 | 0.7-27 | Mixed | WMT17 English-German | A100-SXM4-40GB | ||
0.62 | 25.00 BLEU | 480x A100 | DGX-A100 | 0.7-30 | Mixed | WMT17 English-German | A100-SXM4-40GB | ||
TensorFlow | MiniGo | 299.73 | 50% win rate vs. checkpoint | 8x A100 | DGX-A100 | 0.7-20 | Mixed | N/A | A100-SXM4-40GB |
165.72 | 50% win rate vs. checkpoint | 16x A100 | DGX-A100 | 0.7-23 | Mixed | N/A | A100-SXM4-40GB | ||
29.7 | 50% win rate vs. checkpoint | 256x A100 | DGX-A100 | 0.7-29 | Mixed | N/A | A100-SXM4-40GB | ||
17.07 | 50% win rate vs. checkpoint | 1792x A100 | DGX-A100 | 0.7-36 | Mixed | N/A | A100-SXM4-40GB | ||
NVIDIA Merlin HugeCTR | DLRM | 3.33 | 0.8025 AUC | 8x A100 | DGX-A100 | 0.7-17 | Mixed | Criteo AI Lab’s Terabyte Click-Through-Rate (CTR) | A100-SXM4-40GB |
Training Natural Language Processing
BERT Pre-Training Throughput
DGX-A100 server w/ 8x NVIDIA A100 on PyTorch | DGX-1 server w/ 8x NVIDIA V100 on PyTorch (2/3)Phase 1 and (1/3)Phase 2 | Precision: FP16 for A100 and Mixed for V100 | Sequence Length for Phase 1= 128 and Phase 2 = 512
NVIDIA A100 BERT Training Benchmarks
Framework
Network
Throughput
GPU
Server
Container
Precision
Batch Size
Dataset
GPU Version
PyTorch
BERT Pre-Training
2,274 sequences/sec
8x A100
DGX-A100
-
FP16
-
Wikipedia+BookCorpus
A100 SXM4-40GB
DGX-A100 server w/ 8x NVIDIA A100 on PyTorch (2/3)Phase 1 and (1/3)Phase 2 | Sequence Length for Phase 1 = 128 and Phase 2 = 512
Converged Training Performance
Benchmarks are reproducible by following links to NGC scripts
A100 Training Performance
Framework | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
MXNet | ResNet-50 v1.5 | 40 | 75.9 Top 1 Accuracy | 22,008 images/sec | 8x A100 | DGX-A100 | 20.06-py3 | Mixed | 408 | ImageNet2012 | A100-SXM4-40GB |
PyTorch | Mask R-CNN | 191 | 0.34 AP Segm | 159 images/sec | 8x A100 | DGX-A100 | 20.10-py3 | TF32 | 8 | COCO 2014 | A100-SXM4-40GB |
ResNeXt101 | 300 | 79.37 Top 1 Accuracy | 6,888 images/sec | 8x A100 | DGX-A100 | - | Mixed | 128 | Imagenet2012 | A100-SXM4-40GB | |
SE-ResNeXt101 | 420 | 79.95 Top 1 Accuracy | 4,758 images/sec | 8x A100 | DGX-A100 | - | Mixed | 128 | Imagenet2012 | A100-SXM4-40GB | |
SSD v1.1 | 43 | 0.25 mAP | 3,048 images/sec | 8x A100 | DGX-A100 | 20.10-py3 | Mixed | 128 | COCO 2017 | A100-SXM4-40GB | |
Tacotron2 | 123 | 0.6 Training Loss | 250,632 total output mels/sec | 8x A100 | DGX-A100 | 20.10-py3 | TF32 | 128 | LJSpeech 1.1 | A100-SXM4-40GB | |
WaveGlow | 420 | -5.68 Training Loss | 1,004,778 output samples/sec | 8x A100 | DGX-A100 | 20.10-py3 | Mixed | 10 | LJSpeech 1.1 | A100-SXM4-40GB | |
Transformer | 128 | 27.71 BLEU Score | 531,662 words/sec | 8x A100 | DGX-A100 | 20.07-py3 | Mixed | 10240 | wmt14-en-de | A100-SXM4-40GB | |
FastPitch | 112 | 0.21 Training Loss | 820,539 frames/sec | 8x A100 | DGX-A100 | 20.10-py3 | Mixed | 32 | LJSpeech 1.1 | A100-SXM4-40GB | |
GNMT V2 | 18 | 24.01 BLEU Score | 829,778 total tokens/sec | 8x A100 | DGX-A100 | 20.10-py3 | Mixed | 128 | wmt16-en-de | A100-SXM4-40GB | |
NCF | 0.42 | 0.96 Hit Rate at 10 | 138,413,232 samples/sec | 8x A100 | - | 20.10-py3 | Mixed | 131072 | MovieLens 20M | A100-SXM4-80GB | |
BERT-LARGE Pre-Training | 3,732 | Final Loss 1.34 | 2,284 sequences/sec | 8x A100 | DGX-A100 | 20.06-py3 | Mixed | - | Wikipedia+BookCorpus | A100-SXM4-40GB | |
BERT-LARGE Fine Tuning | 4 | 91.03 F1 | 851 sequences/sec | 8x A100 | - | 20.10-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM4-80GB | |
TensorFlow | ResNet-50 V1.5 | 112 | 76.98 Top 1 Accuracy | 17,343 images/sec | 8x A100 | DGX-A100 | 20.09-py3 | Mixed | 256 | ImageNet2012 | A100-SXM4-40GB |
SSD v1.2 | 45 | 0.28 mAP | 1,619 images/sec | 8x A100 | DGX-A100 | 20.09-py3 | Mixed | 32 | COCO 2017 | A100-SXM4-40GB | |
Mask R-CNN | 180 | 0.34 AP Segm | 149 samples/sec | 8x A100 | DGX-A100 | 20.10-py3 | Mixed | 4 | COCO 2014 | A100-SXM4-40GB | |
U-Net Industrial | 0.92 | 0.99 IoU Threshold 0.95 | 779 images/sec | 8x A100 | DGX-A100 | 20.10-py3 | Mixed | 2 | DAGM2007 | A100-SXM4-40GB | |
U-Net Medical | 3 | 0.89 DICE Score | 901 images/sec | 8x A100 | DGX-A100 | 20.10-py3 | Mixed | 8 | EM segmentation challenge | A100-SXM4-40GB | |
VAE-CF | 2 | 0.43 NDCG@100 | 1,326,534 users processed/sec | 8x A100 | DGX-A100 | 20.10-py3 | Mixed | 3072 | MovieLens 20M | A100-SXM4-40GB | |
GNMT V2 | 123 | 24.29 BLEU Score | 223,452 total tokens/sec | 8x A100 | DGX-A100 | 20.08-py3 | TF32 | 128 | wmt16-en-de | A100-SXM4-40GB | |
BERT-LARGE Pre-Training | 4,170 | Final Loss 1.56 | 2,045 sequences/sec | 8x A100 | DGX-A100 | 20.06-py3 | Mixed | - | Wikipedia+BookCorpus | A100-SXM4-40GB | |
BERT-LARGE Fine Tuning | 12 | 90.69 F1 | 755 sequences/sec | 8x A100 | DGX-A100 | 20.10-py3 | Mixed | 24 | SQuaD v1.1 | A100-SXM4-40GB | |
ResNeXt101 | 208 | 79.20 Top 1 Accuracy | 8,500 images/sec | 8x A100 | DGX-A100 | - | Mixed | 256 | Imagenet2012 | A100-SXM4-40GB | |
EfficientNet-B4 | 4,230.76 | 82.81 | 2,535 images/sec | 8x A100 | DGX-A100 | 20.08-py3 | Mixed | 160 | ImageNet2012 | A100-SXM4-80GB |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen is a pre-production server
BERT-Large Fine Tuning: Sequence Length = 384
BERT-Large Pre-Training (9/10 epochs) Phase 1 and (1/10 epochs) Phase 2: Sequence Length for Phase 1 = 128 and Phase 2 = 512 | Batch Size for Phase 1 = 65,536 and Phase 2 = 32,768
EfficientNet-B4: Mixup = 0.2 | Auto-Augmentation | cuDNN Version = 8.0.5.39 | NCCL Version = 2.7.8
V100 Training Performance
Framework | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
MXNet | ResNet-50 v1.5 | 177 | 77.22 Top 1 Accuracy | 11,180 images/sec | 8x V100 | DGX-2 | 20.10-py3 | Mixed | 256 | ImageNet2012 | V100-SXM3-32GB |
PyTorch | Mask R-CNN | 270 | 0.34 AP Segm | 109 images/sec | 8x V100 | DGX-2 | 20.10-py3 | Mixed | 8 | COCO 2014 | V100-SXM3-32GB |
ResNeXt101 | 540 | 79.43 Top 1 Accuracy | 4,001 images/sec | 8x V100 | DGX-1 | - | Mixed | 128 | Imagenet2012 | V100-SXM2-16GB | |
SE-ResNeXt101 | 780 | 80.04 Top 1 Accuracy | 2,695 images/sec | 8x V100 | DGX-1 | - | Mixed | 128 | Imagenet2012 | V100-SXM2-16GB | |
SSD v1.1 | 71 | 0.19 mAP | 1,832 images/sec | 8x V100 | DGX-2 | 20.10-py3 | Mixed | 64 | COCO 2017 | V100-SXM3-32GB | |
Tacotron2 | 247 | 0.52 Training Loss | 127,321 total output mels/sec | 8x V100 | DGX-2 | 20.10-py3 | Mixed | 104 | LJSpeech 1.1 | V100-SXM3-32GB | |
Transformer | 463 | 27.52 BLEU Score | 214,856 words/sec | 8x V100 | DGX-2 | 20.08-py3 | Mixed | 5120 | wmt14-en-de | V100-SXM3-32GB | |
GNMT V2 | 31 | 24.03 BLEU Score | 479,039 total tokens/sec | 8x V100 | DGX-2 | 20.10-py3 | Mixed | 128 | wmt16-en-de | V100-SXM3-32GB | |
NCF | 0.57 | 0.96 Hit Rate at 10 | 96,982,525 samples/sec | 8x V100 | DGX-2 | 20.10-py3 | Mixed | 131072 | MovieLens 20M | V100-SXM3-32GB | |
BERT-LARGE Fine Tuning | 8 | 91.18 F1 | 354 sequences/sec | 8x V100 | DGX-2 | 20.10-py3 | Mixed | 10 | SQuaD v1.1 | V100-SXM3-32GB | |
TensorFlow | ResNet-50 V1.5 | 193 | 76.83 Top 1 Accuracy | 10,036 images/sec | 8x V100 | DGX-2 | 20.09-py3 | Mixed | 256 | ImageNet2012 | V100-SXM3-32GB |
Mask R-CNN | 286 | 0.34 AP Segm | 91 samples/sec | 8x V100 | DGX-2 | 20.10-py3 | Mixed | 4 | COCO 2014 | V100-SXM3-32GB | |
U-Net Medical | 4 | 0.89 DICE Score | 447 images/sec | 8x V100 | DGX-2 | 20.10-py3 | Mixed | 8 | EM segmentation challenge | V100-SXM3-32GB | |
BERT-LARGE Fine Tuning | 20 | 90.87 F1 | 326 sequences/sec | 8x V100 | DGX-2 | 20.10-py3 | Mixed | 10 | SQuaD v1.1 | V100-SXM3-32GB | |
ResNeXt101 | 480 | 79.30 Top 1 Accuracy | 4,160 images/sec | 8x V100 | DGX-1 | - | Mixed | 128 | Imagenet2012 | V100-SXM2-16GB |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large Fine Tuning: Sequence Length = 384
T4 Training Performance
Framework | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
MXNet | ResNet-50 v1.5 | 501 | 77.16 Top 1 Accuracy | 3,894 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | 192 | ImageNet2012 | NVIDIA T4 |
PyTorch | Mask R-CNN | 576 | 0.34 AP Segm | 42 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | 4 | COCO 2014 | NVIDIA T4 |
ResNeXt101 | 1,738 | 78.75 Top 1 Accuracy | 1,124 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | 128 | Imagenet2012 | NVIDIA T4 | |
Tacotron2 | 284 | 0.52 Training Loss | 109,509 total output mels/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | 104 | LJSpeech 1.1 | NVIDIA T4 | |
WaveGlow | 1,182 | -5.69 Training Loss | 350,957 output samples/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | 10 | LJSpeech 1.1 | NVIDIA T4 | |
Transformer | 807 | 27.68 BLEU Score | 83,760 words/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.07-py3 | Mixed | 5120 | wmt14-en-de | NVIDIA T4 | |
FastPitch | 319 | 0.21 Training Loss | 281,406 frames/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | 32 | LJSpeech 1.1 | NVIDIA T4 | |
GNMT V2 | 90 | 24.27 BLEU Score | 160,009 total tokens/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | 128 | wmt16-en-de | NVIDIA T4 | |
NCF | 2 | 0.96 Hit Rate at 10 | 25,792,995 samples/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | 131072 | MovieLens 20M | NVIDIA T4 | |
BERT-LARGE Fine Tuning | 23 | 91.25 F1 | 127 sequences/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | 10 | SQuaD v1.1 | NVIDIA T4 | |
TensorFlow | ResNet-50 V1.5 | 639 | 77.08 Top 1 Accuracy | 3,010 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.08-py3 | Mixed | 256 | ImageNet2012 | NVIDIA T4 |
SSD v1.2 | 112 | 0.29 mAP | 548 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | 32 | COCO 2017 | NVIDIA T4 | |
Mask R-CNN | 489 | 0.34 AP Segm | 52 samples/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | 4 | COCO 2014 | NVIDIA T4 | |
U-Net Industrial | 2 | 0.99 IoU Threshold 0.95 | 286 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | 2 | DAGM2007 | NVIDIA T4 | |
U-Net Medical | 13 | 0.9 DICE Score | 155 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | 8 | EM segmentation challenge | NVIDIA T4 | |
VAE-CF | 2 | 0.43 NDCG@100 | 368,484 users processed/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | 3072 | MovieLens 20M | NVIDIA T4 | |
GNMT V2 | 309 | 24.26 BLEU Score | 64,567 total tokens/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.08-py3 | Mixed | 128 | wmt16-en-de | NVIDIA T4 | |
BERT-LARGE Fine Tuning | 65 | 91.13 F1 | 53 sequences/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | 3 | SQuaD v1.1 | NVIDIA T4 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large Fine Tuning: Sequence Length = 384
Deploying AI in real world applications, requires training the networks to convergence at a specified accuracy. This is the best methodology to test AI systems, and is typically done on multi-accelerator systems (see the ‘Training-Convergence’ tab or read our blog on convergence for more details) to shorten training-to-convergence times, especially for recurrent monthly container builds.
Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.
NVIDIA’s complete solution stack, from hardware to software, allows data scientists to deliver unprecedented acceleration at every scale. Visit NVIDIA GPU Cloud (NGC) to pull containers and quickly get up and running with deep learning.
Single GPU Training Performance
Benchmarks are reproducible by following links to NGC scripts
A100 Training Performance
Framework | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|
MXNet | ResNet-50 v1.5 | 2,751 images/sec | 1x A100 | - | - | Mixed | 408 | ImageNet2012 | A100-SXM4-80GB |
PyTorch | Mask R-CNN | 23 images/sec | 1x A100 | - | 20.10-py3 | Mixed | 8 | COCO 2014 | A100-SXM4-80GB |
ResNeXt101 | 908 images/sec | 1x A100 | DGX-A100 | - | Mixed | 128 | Imagenet2012 | A100-SXM4-40GB | |
SE-ResNeXt101 | 642 images/sec | 1x A100 | DGX-A100 | - | Mixed | 128 | Imagenet2012 | A100-SXM4-40GB | |
SSD v1.1 | 427 images/sec | 1x A100 | - | 20.10-py3 | Mixed | 128 | COCO 2017 | A100-SXM4-80GB | |
Tacotron2 | 31,640 total output mels/sec | 1x A100 | DGX-A100 | 20.10-py3 | TF32 | 128 | LJSpeech 1.1 | A100-SXM4-40GB | |
WaveGlow | 147,276 output samples/sec | 1x A100 | DGX-A100 | 20.10-py3 | Mixed | 10 | LJSpeech 1.1 | A100-SXM4-40GB | |
Transformer | 77,911 words/sec | 1x A100 | DGX-A100 | 20.07-py3 | Mixed | 10240 | wmt14-en-de | A100-SXM4-40GB | |
FastPitch | 174,736 frames/sec | 1x A100 | - | 20.10-py3 | Mixed | 128 | LJSpeech 1.1 | A100-SXM4-80GB | |
GNMT V2 | 140,793 total tokens/sec | 1x A100 | - | 20.10-py3 | Mixed | 128 | wmt16-en-de | A100-SXM4-80GB | |
NCF | 35,372,898 samples/sec | 1x A100 | - | 20.10-py3 | Mixed | 1048576 | MovieLens 20M | A100-SXM4-80GB | |
BERT-LARGE Fine Tuning | 115 sequences/sec | 1x A100 | - | 20.10-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM4-80GB | |
TensorFlow | SSD v1.2 | 329 images/sec | 1x A100 | - | 20.10-py3 | Mixed | 32 | COCO 2017 | A100-SXM4-80GB |
Mask R-CNN | 20 samples/sec | 1x A100 | DGX-A100 | 20.10-py3 | Mixed | 4 | COCO 2014 | A100-SXM4-40GB | |
U-Net Industrial | 307 images/sec | 1x A100 | DGX-A100 | 20.10-py3 | Mixed | 16 | DAGM2007 | A100-SXM4-40GB | |
U-Net Medical | 138 images/sec | 1x A100 | - | 20.10-py3 | Mixed | 8 | EM segmentation challenge | A100-SXM4-80GB | |
VAE-CF | 369,602 users processed/sec | 1x A100 | DGX-A100 | 20.10-py3 | Mixed | 24576 | MovieLens 20M | A100-SXM4-40GB | |
GNMT V2 | 34,549 total tokens/sec | 1x A100 | - | 20.10-py3 | TF32 | 128 | wmt16-en-de | A100-SXM4-80GB | |
BERT-LARGE Fine Tuning | 107 sequences/sec | 1x A100 | DGX-A100 | 20.10-py3 | Mixed | 24 | SQuaD v1.1 | A100-SXM4-40GB | |
NCF | 40,020,400 samples/sec | 1x A100 | DGX-A100 | 20.10-py3 | Mixed | 1048576 | MovieLens 20M | A100-SXM4-40GB | |
ResNeXt101 | 1,132 images/sec | 1x A100 | DGX-A100 | - | Mixed | 256 | Imagenet2012 | A100-SXM4-40GB | |
EfficientNet-B4 | 332 images/sec | 1x A100 | DGX-A100 | - | Mixed | 160 | Imagenet2012 | A100-SXM4-80GB |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large Fine Tuning: Sequence Length = 384
EfficientNet-B4: Basic Augmentation | cuDNN Version = 8.0.5.32 | NCCL Version = 2.7.8 | Installation Source = NGC
V100 Training Performance
Framework | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|
PyTorch | ResNeXt101 | 543 images/sec | 1x V100 | DGX-2 | 20.11-py3 | Mixed | 128 | Imagenet2012 | V100-SXM3-32GB |
SE-ResNeXt101 | 399 images/sec | 1x V100 | DGX-2 | 20.11-py3 | Mixed | 128 | Imagenet2012 | V100-SXM3-32GB | |
SSD v1.1 | 241 images/sec | 1x V100 | DGX-2 | 20.10-py3 | Mixed | 64 | COCO 2017 | V100-SXM3-32GB | |
Tacotron2 | 18,281 total output mels/sec | 1x V100 | DGX-2 | 20.08-py3 | Mixed | 104 | LJSpeech 1.1 | V100-SXM3-32GB | |
Transformer | 33,652 words/sec | 1x V100 | DGX-2 | 20.10-py3 | Mixed | 5120 | wmt14-en-de | V100-SXM3-32GB | |
FastPitch | 106,930 frames/sec | 1x V100 | DGX-2 | 20.10-py3 | Mixed | 64 | LJSpeech 1.1 | V100-SXM3-32GB | |
GNMT V2 | 81,360 total tokens/sec | 1x V100 | DGX-2 | 20.10-py3 | Mixed | 128 | wmt16-en-de | V100-SXM3-32GB | |
NCF | 21,994,770 samples/sec | 1x V100 | DGX-2 | 20.10-py3 | Mixed | 1048576 | MovieLens 20M | V100-SXM3-32GB | |
BERT-LARGE Fine Tuning | 50 sequences/sec | 1x V100 | DGX-2 | 20.10-py3 | Mixed | 10 | SQuaD v1.1 | V100-SXM3-32GB | |
TensorFlow | U-Net Industrial | 109 images/sec | 1x V100 | DGX-2 | 20.10-py3 | Mixed | 16 | DAGM2007 | V100-SXM3-32GB |
U-Net Medical | 64 images/sec | 1x V100 | DGX-2 | 20.10-py3 | Mixed | 8 | EM segmentation challenge | V100-SXM3-32GB | |
VAE-CF | 222,299 users processed/sec | 1x V100 | DGX-2 | 20.10-py3 | Mixed | 24576 | MovieLens 20M | V100-SXM3-32GB | |
NCF | 25,694,022 samples/sec | 1x V100 | DGX-2 | 20.08-py3 | Mixed | 1048576 | MovieLens 20M | V100-SXM3-32GB | |
BERT-LARGE Fine Tuning | 48 sequences/sec | 1x V100 | DGX-2 | 20.10-py3 | Mixed | 10 | SQuaD v1.1 | V100-SXM3-32GB | |
ResNeXt101 | 610 images/sec | 1x V100 | DGX-2 | 20.11-py3 | Mixed | 128 | Imagenet2012 | V100-SXM3-32GB |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large Fine Tuning: Sequence Length = 384
T4 Training Performance
Framework | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|
MXNet | ResNet-50 v1.5 | 483 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.10-py3 | Mixed | 64 | ImageNet2012 | NVIDIA T4 |
PyTorch | Mask R-CNN | 8 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.10-py3 | Mixed | 4 | COCO 2014 | NVIDIA T4 |
ResNeXt101 | 208 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.11-py3 | Mixed | 128 | Imagenet2012 | NVIDIA T4 | |
SE-ResNeXt101 | 157 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.11-py3 | Mixed | 128 | Imagenet2012 | NVIDIA T4 | |
SSD v1.1 | 85 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.10-py3 | Mixed | 64 | COCO 2017 | NVIDIA T4 | |
Tacotron2 | 15,022 total output mels/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.10-py3 | Mixed | 104 | LJSpeech 1.1 | NVIDIA T4 | |
WaveGlow | 50,938 output samples/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.10-py3 | Mixed | 10 | LJSpeech 1.1 | NVIDIA T4 | |
Transformer | 12,869 words/sec | 1x T4 | Supermicro SYS-4029GP-TRT | 20.07-py3 | Mixed | 5120 | wmt14-en-de | NVIDIA T4 | |
FastPitch | 41,685 frames/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.10-py3 | Mixed | 64 | LJSpeech 1.1 | NVIDIA T4 | |
GNMT V2 | 31,430 total tokens/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.10-py3 | Mixed | 128 | wmt16-en-de | NVIDIA T4 | |
NCF | 8,000,204 samples/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.10-py3 | Mixed | 1048576 | MovieLens 20M | NVIDIA T4 | |
BERT-LARGE Fine Tuning | 18 sequences/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.10-py3 | Mixed | 10 | SQuaD v1.1 | NVIDIA T4 | |
TensorFlow | ResNet-50 V1.5 | 407 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.10-py3 | Mixed | 256 | ImageNet2012 | NVIDIA T4 |
SSD v1.2 | 97 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.10-py3 | Mixed | 32 | COCO 2017 | NVIDIA T4 | |
U-Net Industrial | 41 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.10-py3 | Mixed | 16 | DAGM2007 | NVIDIA T4 | |
U-Net Medical | 21 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.10-py3 | Mixed | 8 | EM segmentation challenge | NVIDIA T4 | |
VAE-CF | 81,201 users processed/sec | 1x T4 | Supermicro SYS-4029GP-TRT | 20.06-py3 | Mixed | 24576 | MovieLens 20M | NVIDIA T4 | |
GNMT V2 | 9,810 total tokens/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.10-py3 | Mixed | 128 | wmt16-en-de | NVIDIA T4 | |
NCF | 10,371,809 samples/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.03-py3 | Mixed | 1048576 | MovieLens 20M | NVIDIA T4 | |
BERT-LARGE Fine Tuning | 13 sequences/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.10-py3 | Mixed | 3 | SQuaD v1.1 | NVIDIA T4 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large Fine Tuning: Sequence Length = 384
Real-world AI inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry leading solution enables customers to quickly deploy AI models into real-world production with the highest performance from data centers to the edge.
NVIDIA landed top performance spots on all MLPerf Inference 0.7 tests, the AI-industry’s leading benchmark. NVIDIA® TensorRT™ running on NVIDIA Tensor Core GPUs enable the most efficient deep learning inference performance across multiple application areas and models. This versatility provides wide latitude to data scientists to create the optimal low-latency solution. Visit NVIDIA GPU Cloud (NGC) to download any of these containers and immediately race into production. The inference whitepaper provides and overview of inference platforms.
Measuring inference performance involves balancing a lot of variables. PLASTER is an acronym that describes the key elements for measuring deep learning performance. Each letter identifies a factor (Programmability, Latency, Accuracy, Size of Model, Throughput, Energy Efficiency, Rate of Learning) that must be considered to arrive at the right set of tradeoffs and to produce a successful deep learning implementation. Refer to the PLASTER whitepaper for more details.
MLPerf Inference v0.7 Performance Benchmarks
Offline Scenario
Network | Throughput | GPU | Server | Dataset | GPU Version |
---|---|---|---|---|---|
ResNet-50 v1.5 | 298,647 samples/sec | 8x A100 | DGX-A100 | ImageNet | A100-SXM4-40GB |
48,898 samples/sec | 8x T4 | Supermicro 4029GP-TRT-OTO-28 | ImageNet | NVIDIA T4 | |
SSD ResNet-34 | 7,788 samples/sec | 8x A100 | DGX-A100 | COCO | A100-SXM4-40GB |
1,112 samples/sec | 8x T4 | Supermicro 4029GP-TRT-OTO-28 | COCO | NVIDIA T4 | |
3D-UNet | 328 samples/sec | 8x A100 | DGX-A100 | BraTS 2019 | A100-SXM4-40GB |
58 samples/sec | 8x T4 | Supermicro 4029GP-TRT-OTO-28 | BraTS 2019 | NVIDIA T4 | |
RNN-T | 82,401 samples/sec | 8x A100 | DGX-A100 | LibriSpeech | A100-SXM4-40GB |
11,963 samples/sec | 8x T4 | Supermicro 4029GP-TRT-OTO-28 | LibriSpeech | NVIDIA T4 | |
BERT | 26,625 samples/sec | 8x A100 | DGX-A100 | SQuAD v1.1 | A100-SXM4-40GB |
3,495 samples/sec | 8x T4 | Supermicro 4029GP-TRT-OTO-28 | SQuAD v1.1 | NVIDIA T4 | |
DLRM | 2,113,510 samples/sec | 8x A100 | DGX-A100 | Criteo 1TB Click Logs | A100-SXM4-40GB |
272,416 samples/sec | 8x T4 | Supermicro 4029GP-TRT-OTO-28 | Criteo 1TB Click Logs | NVIDIA T4 |
MLPerf v0.7 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, 3D U-Net 99.9% accuracy target, RNN-T, BERT 99% accuracy target, DLRM 99.9% accuracy target: 0.7-111, 0.7-113. MLPerf name and logo are trademarks. See www.mlperf.org for more information.
Inference Performance of NVIDIA A100, V100 and T4
Benchmarks are reproducible by following links to NGC scripts
Inference Natural Langugage Processing
BERT Inference Throughput
DGX-A100 server w/ 1x NVIDIA A100 with 7 MIG instances of 1g.5gb | Batch Size = 94 | Precision: INT8 | Sequence Length = 128
DGX-1 server w/ 1x NVIDIA V100 | TensorRT 7.1 | Batch Size = 256 | Precision: Mixed | Sequence Length = 128
NVIDIA A100 BERT Inference Benchmarks
Network | Network Type |
Batch Size |
Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT-Large with Sparsity | Attention | 94 | 6,188 sequences/sec | - | - | 1x A100 | DGX-A100 | - | INT8 | SQuaD v1.1 | - | A100 SXM4-40GB |
A100 with 7 MIG instances of 1g.5gb | Sequence length=128 | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Inference Image Classification on CNNs with TensorRT
ResNet-50 v1.5 Throughput
Pre-production server: Platinum 8168 @2.7GHz w/ 1x NVIDIA A100-SXM4-80GB | TensorRT 7.2.1 | Batch Size = 128 | 20.10-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 7.2.1 | Batch Size = 128 | 20.10-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 7.2.1 | Batch Size = 128 | 20.10-py3 | Precision: INT8 | Dataset: Synthetic
ResNet-50 v1.5 Latency
Pre-production server: Platinum 8168 @2.7GHz w/ 1x NVIDIA A100-SXM4-80GB | TensorRT 7.2.1 | Batch Size = 1 | 20.10-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 7.2.1 | Batch Size = 1 | 20.10-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 7.2.1 | Batch Size = 1 | 20.10-py3 | Precision: INT8 | Dataset: Synthetic
ResNet-50 v1.5 Power Efficiency
Pre-production server: Platinum 8168 @2.7GHz w/ 1x NVIDIA A100-SXM4-80GB | TensorRT 7.2.1 | Batch Size = 128 | 20.10-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 7.2.1 | Batch Size = 128 | 20.10-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 7.2.1 | Batch Size = 128 | 20.10-py3 | Precision: INT8 | Dataset: Synthetic
A100 Inference Performance
Network | Batch Size | 1/7 MIG Throughput | 7 MIG Throughput | Full Chip Throughput | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 1 | - | - | 2,181 images/sec | 0.46 | 1x A100 | - | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | A100-SXM4-80GB |
2 | - | - | 3,971 images/sec | 0.50 | 1x A100 | - | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | A100-SXM4-80GB | |
8 | - | - | 11,186 images/sec | 0.72 | 1x A100 | - | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | A100-SXM4-80GB | |
128 | - | - | 28,463 images/sec | 4.50 | 1x A100 | - | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | A100-SXM4-80GB | |
211 | - | - | 30,726 images/sec | 6.87 | 1x A100 | - | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | A100-SXM4-80GB | |
ResNet-50v1.5 | 1 | - | - | 2,116 images/sec | 0.47 | 1x A100 | - | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | A100-SXM4-80GB |
2 | - | - | 3,973 images/sec | 0.50 | 1x A100 | - | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | A100-SXM4-80GB | |
8 | - | - | 10,940 images/sec | 0.73 | 1x A100 | - | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | A100-SXM4-80GB | |
128 | - | - | 27,443 images/sec | 4.66 | 1x A100 | - | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | A100-SXM4-80GB | |
206 | - | - | 29,550 images/sec | 6.97 | 1x A100 | - | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | A100-SXM4-80GB | |
236 | - | - | 34,249 images/sec | 6.89 | 1x A100 | - | - | INT8 | Synthetic | TensorRT 7.2 | A100-SXM4-40GB | |
ResNext101 | 32 | - | - | 7,674 images/sec | 4.17 | 1x A100 | - | - | INT8 | Synthetic | TensorRT 7.2.2 | A100-SXM4-40GB |
EfficientNet-B0 | 128 | - | - | 22,346 images/sec | 5.73 | 1x A100 | - | - | INT8 | Synthetic | TensorRT 7.2.1.6 | A100-SXM4-40GB |
BERT-BASE | 1 | 590 sequences/sec | - | 1,341 sequences/sec | 0.75 | 1x A100 | DGX-A100 | - | INT8 | Real (Q&A provided as text input) | TensorRT 7.2 | A100-SXM4-40GB |
2 | 888 sequences/sec | - | 2,416 sequences/sec | 0.83 | 1x A100 | DGX-A100 | - | INT8 | Real (Q&A provided as text input) | TensorRT 7.2 | A100-SXM4-40GB | |
8 | 1,455 sequences/sec | - | 6,830 sequences/sec | 1.17 | 1x A100 | DGX-A100 | - | INT8 | Real (Q&A provided as text input) | TensorRT 7.2 | A100-SXM4-40GB | |
128 | 2,101 sequences/sec | - | 13,697 sequences/sec | 9.35 | 1x A100 | DGX-A100 | - | INT8 | Real (Q&A provided as text input) | TensorRT 7.2 | A100-SXM4-40GB | |
256 | 2,142 sequences/sec | - | 14,490 sequences/sec | 17.67 | 1x A100 | DGX-A100 | - | INT8 | Real (Q&A provided as text input) | TensorRT 7.2 | A100-SXM4-40GB | |
BERT-LARGE | 1 | 241 sequences/sec | - | 585 sequences/sec | 1.71 | 1x A100 | DGX-A100 | - | INT8 | Real (Q&A provided as text input) | TensorRT 7.2 | A100-SXM4-40GB |
2 | 312 sequences/sec | - | 1,068 sequences/sec | 1.87 | 1x A100 | DGX-A100 | - | INT8 | Real (Q&A provided as text input) | TensorRT 7.2 | A100-SXM4-40GB | |
8 | 531 sequences/sec | - | 2,152 sequences/sec | 3.72 | 1x A100 | DGX-A100 | - | INT8 | Real (Q&A provided as text input) | TensorRT 7.2 | A100-SXM4-40GB | |
12 | - | - | 2,804 sequences/sec | 4.3 | 1x A100 | DGX-A100 | 20.11-py3 | INT8 | Real (Q&A provided as text input) | TensorRT 7.2 | A100-SXM4-40GB | |
128 | 657 sequences/sec | - | 4,481 sequences/sec | 28.57 | 1x A100 | DGX-A100 | - | INT8 | Real (Q&A provided as text input) | TensorRT 7.2 | A100-SXM4-40GB | |
256 | 681 sequences/sec | 4,741 sequences/sec | 4,679 sequences/sec | 54.71 | 1x A100 | - | - | INT8 | Real (Q&A provided as text input) | TensorRT 7.2 | A100-SXM4-80GB |
Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
V100 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 1 | 1,241 images/sec | 7.8 images/sec/watt | 0.81 | 1x V100 | DGX-2 | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | V100-SXM3-32GB |
2 | 1,995 images/sec | 11 images/sec/watt | 1 | 1x V100 | DGX-2 | 20.10-py3 | Mixed | Synthetic | TensorRT 7.2.1 | V100-SXM3-32GB | |
8 | 4,323 images/sec | 17 images/sec/watt | 1.9 | 1x V100 | DGX-2 | 20.10-py3 | Mixed | Synthetic | TensorRT 7.2.1 | V100-SXM3-32GB | |
52 | 7,833 images/sec | - | 6.6 | 1x V100 | DGX-2 | 20.10-py3 | Mixed | Synthetic | TensorRT 7.2.1 | V100-SXM3-32GB | |
128 | 8,115 images/sec | 24 images/sec/watt | 16 | 1x V100 | DGX-2 | 20.10-py3 | Mixed | Synthetic | TensorRT 7.2.1 | V100-SXM3-32GB | |
ResNet-50v1.5 | 1 | 1,134 images/sec | 7.3 images/sec/watt | 0.88 | 1x V100 | DGX-2 | 20.10-py3 | Mixed | Synthetic | TensorRT 7.2.1 | V100-SXM3-32GB |
2 | 1,985 images/sec | 11 images/sec/watt | 1 | 1x V100 | DGX-2 | 20.10-py3 | Mixed | Synthetic | TensorRT 7.2.1 | V100-SXM3-32GB | |
8 | 4,183 images/sec | 16 images/sec/watt | 1.9 | 1x V100 | DGX-2 | 20.10-py3 | Mixed | Synthetic | TensorRT 7.2.1 | V100-SXM3-32GB | |
52 | 7,540 images/sec | - | 6.9 | 1x V100 | DGX-2 | 20.10-py3 | Mixed | Synthetic | TensorRT 7.2.1 | V100-SXM3-32GB | |
128 | 7,797 images/sec | 23 images/sec/watt | 16 | 1x V100 | DGX-2 | 20.10-py3 | Mixed | Synthetic | TensorRT 7.2.1 | V100-SXM3-32GB | |
BERT-BASE | 1 | 817 sequences/sec | 4.3 sequences/sec/watt | 1.2 | 1x V100 | DGX-2 | 20.11-py3 | INT8 | Sample Text | TensorRT 7.2.1 | V100-SXM3-32GB |
2 | 1,307 sequences/sec | - | 1.5 | 1x V100 | DGX-2 | 20.03-py3 | Mixed | Sample Text | TensorRT 7.0.0 | V100-SXM3-32GB | |
8 | 2,389 sequences/sec | - | 3.4 | 1x V100 | DGX-2 | 20.03-py3 | Mixed | Sample Text | TensorRT 7.0.0 | V100-SXM3-32GB | |
26 | 3,002 sequences/sec | 13.36 sequences/sec/watt | 8.66 | 1x V100 | DGX-2 | 20.03-py3 | Mixed | Sample Text | TensorRT 7.0.0 | V100-SXM3-32GB | |
128 | 3,194 sequences/sec | 11 sequences/sec/watt | 40 | 1x V100 | DGX-2 | 20.11-py3 | INT8 | Sample Text | TensorRT 7.2.1 | V100-SXM3-32GB | |
BERT-LARGE | 1 | 310 sequences/sec | 1.6 sequences/sec/watt | 3.2 | 1x V100 | DGX-2 | 20.11-py3 | INT8 | Sample Text | TensorRT 7.2.1 | V100-SXM3-32GB |
2 | 507 sequences/sec | 3.68 sequences/sec/watt | 3.95 | 1x V100 | DGX-2 | 20.03-py3 | Mixed | Sample Text | TensorRT 7.0.0 | V100-SXM3-32GB | |
8 | 792 sequences/sec | 3 sequences/sec/watt | 10 | 1x V100 | DGX-2 | 20.11-py3 | INT8 | Sample Text | TensorRT 7.2.1 | V100-SXM3-32GB | |
128 | 1,043 sequences/sec | 3.1 sequences/sec/watt | 123 | 1x V100 | DGX-2H | 20.03-py3 | Mixed | Sample Text | TensorRT 7.0.0 | V100-SXM3-32GB-H | |
WaveGlow | 1 | 614,788 output samples/sec | - | 0.37 | 1x V100 | DGX-2 | 20.10-py3 | Mixed | LJSpeech 1.1 | PyTorch 1.7.0a0+7036e91 | V100-SXM3-32GB |
NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
T4 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 1 | 1,408 images/sec | 20 images/sec/watt | 0.71 | 1x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | NVIDIA T4 |
2 | 1,987 images/sec | 29 images/sec/watt | 1 | 1x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | NVIDIA T4 | |
8 | 3,884 images/sec | 56 images/sec/watt | 2.1 | 1x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | NVIDIA T4 | |
32 | 4,914 images/sec | - | 6.5 | 1x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | NVIDIA T4 | |
128 | 5,271 images/sec | 75 images/sec/watt | 24 | 1x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | NVIDIA T4 | |
ResNet-50v1.5 | 1 | 1,382 images/sec | 20 images/sec/watt | 0.72 | 1x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | NVIDIA T4 |
2 | 2,186 images/sec | 31 images/sec/watt | 0.92 | 1x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | NVIDIA T4 | |
8 | 3,785 images/sec | 54 images/sec/watt | 2.1 | 1x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | NVIDIA T4 | |
30 | 4,585 images/sec | - | 6.5 | 1x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | NVIDIA T4 | |
128 | 4,846 images/sec | 69 images/sec/watt | 26 | 1x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | INT8 | Synthetic | TensorRT 7.2.1 | NVIDIA T4 | |
BERT-BASE | 1 | 725 sequences/sec | 11 sequences/sec/watt | 1.4 | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.11-py3 | INT8 | Sample Text | TensorRT 7.2.1 | NVIDIA T4 |
2 | 1079 sequences/sec | 17 sequences/sec/watt | 1.9 | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.11-py3 | INT8 | Sample Text | TensorRT 7.2.1 | NVIDIA T4 | |
8 | 1,720 sequences/sec | 28 sequences/sec/watt | 4.7 | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.11-py3 | INT8 | Sample Text | TensorRT 7.2.1 | NVIDIA T4 | |
128 | 1,818 sequences/sec | 28 sequences/sec/watt | 70 | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.11-py3 | INT8 | Sample Text | TensorRT 7.2.1 | NVIDIA T4 | |
BERT-LARGE | 1 | 261 sequences/sec | 4.2 sequences/sec/watt | 3.8 | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.11-py3 | INT8 | Sample Text | TensorRT 7.2.1 | NVIDIA T4 |
2 | 390 sequences/sec | 6.2 sequences/sec/watt | 5.1 | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.11-py3 | INT8 | Sample Text | TensorRT 7.2.1 | NVIDIA T4 | |
8 | 555 sequences/sec | 8.9 sequences/sec/watt | 14 | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.11-py3 | INT8 | Sample Text | TensorRT 7.2.1 | NVIDIA T4 | |
128 | 561 sequences/sec | 8.3 sequences/sec/watt | 228 | 1x T4 | Supermicro SYS-1029GQ-TRT | 20.11-py3 | INT8 | Sample Text | TensorRT 7.2.1 | NVIDIA T4 | |
WaveGlow | 1 | 187,595 output samples/sec | - | 1.2 | 1x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | LJSpeech 1.1 | PyTorch 1.7.0a0+7036e91 | NVIDIA T4 |
NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Last updated: February 2nd, 2021