Review the latest GPU acceleration factors of popular HPC applications.

Please refer to Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide for instructions on how to reproduce these performance claims.


Deploying AI in real world applications, requires training the networks to convergence at a specified accuracy. This is the best methodology to test AI systems- where they are ready to be deployed in the field, as the networks can then deliver meaningful results (for example, correctly performing image recognition on video streams). Read our blog on convergence for more details. Training that does not converge is a measurement of hardware’s throughput capabilities on the specified AI network, but is not representative of real world applications.

NVIDIA’s complete solution stack, from GPUs to libraries, and containers on NVIDIA GPU Cloud (NGC), allows data scientists to quickly get up and running with deep learning. NVIDIA® A100 Tensor Core GPUs provides unprecedented acceleration at every scale, setting records in MLPerf, the AI industry’s leading benchmark and a testament to our accelerated platform approach.

NVIDIA Performance on MLPerf 0.7 AI Benchmarks

BERT Time to Train on A100

PyTorch | Precision: Mixed | Dataset: Wikipedia 2020/01/01 | Convergence criteria - refer to MLPerf requirements

MLPerf Training Performance

NVIDIA A100 Performance on MLPerf 0.7 AI Benchmarks

Framework Network Time to Train (mins) MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
MXNet ResNet-50 v1.5 39.78 75.90% classification 8x A100 DGX-A100 0.7-18 Mixed ImageNet2012 A100-SXM4-40GB
23.75 75.90% classification 16x A100 DGX-A100 0.7-21 Mixed ImageNet2012 A100-SXM4-40GB
1.06 75.90% classification 768x A100 DGX-A100 0.7-32 Mixed ImageNet2012 A100-SXM4-40GB
0.83 75.90% classification 1536x A100 DGX-A100 0.7-35 Mixed ImageNet2012 A100-SXM4-40GB
0.76 75.90% classification 1840x A100 DGX-A100 0.7-37 Mixed ImageNet2012 A100-SXM4-40GB
SSD 2.25 23.0% mAP 64x A100 DGX-A100 0.7-25 Mixed COCO2017 A100-SXM4-40GB
0.89 23.0% mAP 512x A100 DGX-A100 0.7-31 Mixed COCO2017 A100-SXM4-40GB
0.82 23.0% mAP 1024x A100 DGX-A100 0.7-33 Mixed COCO2017 A100-SXM4-40GB
PyTorch BERT 49.01 0.712 Mask-LM accuracy 8x A100 DGX-A100 0.7-19 Mixed Wikipedia 2020/01/01 A100-SXM4-40GB
30.63 0.712 Mask-LM accuracy 16x A100 DGX-A100 0.7-22 Mixed Wikipedia 2020/01/01 A100-SXM4-40GB
3.36 0.712 Mask-LM accuracy 256x A100 DGX-A100 0.7-28 Mixed Wikipedia 2020/01/01 A100-SXM4-40GB
1.48 0.712 Mask-LM accuracy 1024x A100 DGX-A100 0.7-34 Mixed Wikipedia 2020/01/01 A100-SXM4-40GB
0.81 0.712 Mask-LM accuracy 2048x A100 DGX-A100 0.7-38 Mixed Wikipedia 2020/01/01 A100-SXM4-40GB
DLRM 4.43 0.8025 AUC 8x A100 DGX-A100 0.7-19 Mixed Criteo AI Lab’s Terabyte Click-Through-Rate (CTR) A100-SXM4-40GB
GNMT 7.81 24.0 Sacre BLEU 8x A100 DGX-A100 0.7-19 Mixed WMT16 English-German A100-SXM4-40GB
4.94 24.0 Sacre BLEU 16x A100 DGX-A100 0.7-22 Mixed WMT16 English-German A100-SXM4-40GB
0.98 24.0 Sacre BLEU 256x A100 DGX-A100 0.7-28 Mixed WMT16 English-German A100-SXM4-40GB
0.71 24.0 Sacre BLEU 1024x A100 DGX-A100 0.7-34 Mixed WMT16 English-German A100-SXM4-40GB
Mask R-CNN 82.16 0.377 Box min AP and 0.339 Mask min AP 8x A100 DGX-A100 0.7-19 Mixed COCO2017 A100-SXM4-40GB
44.21 0.377 Box min AP and 0.339 Mask min AP 16x A100 DGX-A100 0.7-22 Mixed COCO2017 A100-SXM4-40GB
28.46 0.377 Box min AP and 0.339 Mask min AP 32x A100 DGX-A100 0.7-24 Mixed COCO2017 A100-SXM4-40GB
10.46 0.377 Box min AP and 0.339 Mask min AP 256x A100 DGX-A100 0.7-28 Mixed COCO2017 A100-SXM4-40GB
SSD 10.21 23.0% mAP 8x A100 DGX-A100 0.7-19 Mixed COCO2017 A100-SXM4-40GB
5.68 23.0% mAP 16x A100 DGX-A100 0.7-22 Mixed COCO2017 A100-SXM4-40GB
Transformer 7.84 25.00 BLEU 8x A100 DGX-A100 0.7-19 Mixed WMT17 English-German A100-SXM4-40GB
4.35 25.00 BLEU 16x A100 DGX-A100 0.7-22 Mixed WMT17 English-German A100-SXM4-40GB
1.8 25.00 BLEU 80x A100 DGX-A100 0.7-26 Mixed WMT17 English-German A100-SXM4-40GB
1.02 25.00 BLEU 160x A100 DGX-A100 0.7-27 Mixed WMT17 English-German A100-SXM4-40GB
0.62 25.00 BLEU 480x A100 DGX-A100 0.7-30 Mixed WMT17 English-German A100-SXM4-40GB
TensorFlow MiniGo 299.73 50% win rate vs. checkpoint 8x A100 DGX-A100 0.7-20 Mixed N/A A100-SXM4-40GB
165.72 50% win rate vs. checkpoint 16x A100 DGX-A100 0.7-23 Mixed N/A A100-SXM4-40GB
29.7 50% win rate vs. checkpoint 256x A100 DGX-A100 0.7-29 Mixed N/A A100-SXM4-40GB
17.07 50% win rate vs. checkpoint 1792x A100 DGX-A100 0.7-36 Mixed N/A A100-SXM4-40GB
NVIDIA Merlin HugeCTR DLRM 3.33 0.8025 AUC 8x A100 DGX-A100 0.7-17 Mixed Criteo AI Lab’s Terabyte Click-Through-Rate (CTR) A100-SXM4-40GB

Training Natural Language Processing

BERT Pre-Training Throughput

DGX-A100 server w/ 8x NVIDIA A100 on PyTorch | DGX-1 server w/ 8x NVIDIA V100 on PyTorch (2/3)Phase 1 and (1/3)Phase 2 | Precision: FP16 for A100 and Mixed for V100 | Sequence Length for Phase 1= 128 and Phase 2 = 512

NVIDIA A100 BERT Training Benchmarks

Framework Network Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch BERT Pre-Training 2,274 sequences/sec 8x A100 DGX-A100 - FP16 - Wikipedia+BookCorpus A100 SXM4-40GB

DGX-A100 server w/ 8x NVIDIA A100 on PyTorch (2/3)Phase 1 and (1/3)Phase 2 | Sequence Length for Phase 1 = 128 and Phase 2 = 512

Converged Training Performance

Benchmarks are reproducible by following links to NGC scripts

A100 Training Performance

Framework Network Time to Train (mins) Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
MXNet ResNet-50 v1.5 40 75.9 Top 1 Accuracy 22,008 images/sec 8x A100 DGX-A100 20.06-py3 Mixed 408 ImageNet2012 A100-SXM4-40GB
PyTorch Mask R-CNN 191 0.34 AP Segm 159 images/sec 8x A100 DGX-A100 20.10-py3 TF32 8 COCO 2014 A100-SXM4-40GB
ResNeXt101 300 79.37 Top 1 Accuracy 6,888 images/sec 8x A100 DGX-A100 - Mixed 128 Imagenet2012 A100-SXM4-40GB
SE-ResNeXt101 420 79.95 Top 1 Accuracy 4,758 images/sec 8x A100 DGX-A100 - Mixed 128 Imagenet2012 A100-SXM4-40GB
SSD v1.1 43 0.25 mAP 3,048 images/sec 8x A100 DGX-A100 20.10-py3 Mixed 128 COCO 2017 A100-SXM4-40GB
Tacotron2 123 0.6 Training Loss 250,632 total output mels/sec 8x A100 DGX-A100 20.10-py3 TF32 128 LJSpeech 1.1 A100-SXM4-40GB
WaveGlow 420 -5.68 Training Loss 1,004,778 output samples/sec 8x A100 DGX-A100 20.10-py3 Mixed 10 LJSpeech 1.1 A100-SXM4-40GB
Transformer 128 27.71 BLEU Score 531,662 words/sec 8x A100 DGX-A100 20.07-py3 Mixed 10240 wmt14-en-de A100-SXM4-40GB
FastPitch 112 0.21 Training Loss 820,539 frames/sec 8x A100 DGX-A100 20.10-py3 Mixed 32 LJSpeech 1.1 A100-SXM4-40GB
GNMT V2 18 24.01 BLEU Score 829,778 total tokens/sec 8x A100 DGX-A100 20.10-py3 Mixed 128 wmt16-en-de A100-SXM4-40GB
NCF 0.42 0.96 Hit Rate at 10 138,413,232 samples/sec 8x A100 - 20.10-py3 Mixed 131072 MovieLens 20M A100-SXM4-80GB
BERT-LARGE Pre-Training 3,732 Final Loss 1.34 2,284 sequences/sec 8x A100 DGX-A100 20.06-py3 Mixed - Wikipedia+BookCorpus A100-SXM4-40GB
BERT-LARGE Fine Tuning 4 91.03 F1 851 sequences/sec 8x A100 - 20.10-py3 Mixed 32 SQuaD v1.1 A100-SXM4-80GB
TensorFlow ResNet-50 V1.5 112 76.98 Top 1 Accuracy 17,343 images/sec 8x A100 DGX-A100 20.09-py3 Mixed 256 ImageNet2012 A100-SXM4-40GB
SSD v1.2 45 0.28 mAP 1,619 images/sec 8x A100 DGX-A100 20.09-py3 Mixed 32 COCO 2017 A100-SXM4-40GB
Mask R-CNN 180 0.34 AP Segm 149 samples/sec 8x A100 DGX-A100 20.10-py3 Mixed 4 COCO 2014 A100-SXM4-40GB
U-Net Industrial 0.92 0.99 IoU Threshold 0.95 779 images/sec 8x A100 DGX-A100 20.10-py3 Mixed 2 DAGM2007 A100-SXM4-40GB
U-Net Medical 3 0.89 DICE Score 901 images/sec 8x A100 DGX-A100 20.10-py3 Mixed 8 EM segmentation challenge A100-SXM4-40GB
VAE-CF 2 0.43 NDCG@100 1,326,534 users processed/sec 8x A100 DGX-A100 20.10-py3 Mixed 3072 MovieLens 20M A100-SXM4-40GB
GNMT V2 123 24.29 BLEU Score 223,452 total tokens/sec 8x A100 DGX-A100 20.08-py3 TF32 128 wmt16-en-de A100-SXM4-40GB
BERT-LARGE Pre-Training 4,170 Final Loss 1.56 2,045 sequences/sec 8x A100 DGX-A100 20.06-py3 Mixed - Wikipedia+BookCorpus A100-SXM4-40GB
BERT-LARGE Fine Tuning 12 90.69 F1 755 sequences/sec 8x A100 DGX-A100 20.10-py3 Mixed 24 SQuaD v1.1 A100-SXM4-40GB
ResNeXt101 208 79.20 Top 1 Accuracy 8,500 images/sec 8x A100 DGX-A100 - Mixed 256 Imagenet2012 A100-SXM4-40GB
EfficientNet-B4 4,615 82.87 2,538 images/sec 8x A100 DGX-A100 20.08-py3 Mixed 160 ImageNet2012 A100-SXM4-80GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen is a pre-production server
BERT-Large Fine Tuning: Sequence Length = 384
BERT-Large Pre-Training (9/10 epochs) Phase 1 and (1/10 epochs) Phase 2: Sequence Length for Phase 1 = 128 and Phase 2 = 512 | Batch Size for Phase 1 = 65,536 and Phase 2 = 32,768
EfficientNet-B4: Mixup = 0.2 | Auto-Augmentation | cuDNN Version = 8.0.5.39 | NCCL Version = 2.7.8

V100 Training Performance

Framework Network Time to Train (mins) Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
MXNet ResNet-50 v1.5 177 77.22 Top 1 Accuracy 11,180 images/sec 8x V100 DGX-2 20.10-py3 Mixed 256 ImageNet2012 V100-SXM3-32GB
PyTorch Mask R-CNN 270 0.34 AP Segm 109 images/sec 8x V100 DGX-2 20.10-py3 Mixed 8 COCO 2014 V100-SXM3-32GB
ResNeXt101 540 79.43 Top 1 Accuracy 4,001 images/sec 8x V100 DGX-1 - Mixed 128 Imagenet2012 V100-SXM2-16GB
SE-ResNeXt101 780 80.04 Top 1 Accuracy 2,695 images/sec 8x V100 DGX-1 - Mixed 128 Imagenet2012 V100-SXM2-16GB
SSD v1.1 71 0.19 mAP 1,832 images/sec 8x V100 DGX-2 20.10-py3 Mixed 64 COCO 2017 V100-SXM3-32GB
Tacotron2 247 0.52 Training Loss 127,321 total output mels/sec 8x V100 DGX-2 20.10-py3 Mixed 104 LJSpeech 1.1 V100-SXM3-32GB
Transformer 463 27.52 BLEU Score 214,856 words/sec 8x V100 DGX-2 20.08-py3 Mixed 5120 wmt14-en-de V100-SXM3-32GB
GNMT V2 31 24.03 BLEU Score 479,039 total tokens/sec 8x V100 DGX-2 20.10-py3 Mixed 128 wmt16-en-de V100-SXM3-32GB
NCF 0.57 0.96 Hit Rate at 10 96,982,525 samples/sec 8x V100 DGX-2 20.10-py3 Mixed 131072 MovieLens 20M V100-SXM3-32GB
BERT-LARGE Fine Tuning 8 91.18 F1 354 sequences/sec 8x V100 DGX-2 20.10-py3 Mixed 10 SQuaD v1.1 V100-SXM3-32GB
TensorFlow ResNet-50 V1.5 193 76.83 Top 1 Accuracy 10,036 images/sec 8x V100 DGX-2 20.09-py3 Mixed 256 ImageNet2012 V100-SXM3-32GB
Mask R-CNN 286 0.34 AP Segm 91 samples/sec 8x V100 DGX-2 20.10-py3 Mixed 4 COCO 2014 V100-SXM3-32GB
U-Net Medical 4 0.89 DICE Score 447 images/sec 8x V100 DGX-2 20.10-py3 Mixed 8 EM segmentation challenge V100-SXM3-32GB
BERT-LARGE Fine Tuning 20 90.87 F1 326 sequences/sec 8x V100 DGX-2 20.10-py3 Mixed 10 SQuaD v1.1 V100-SXM3-32GB
ResNeXt101 480 79.30 Top 1 Accuracy 4,160 images/sec 8x V100 DGX-1 - Mixed 128 Imagenet2012 V100-SXM2-16GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large Fine Tuning: Sequence Length = 384

T4 Training Performance

Framework Network Time to Train (mins) Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
MXNet ResNet-50 v1.5 501 77.16 Top 1 Accuracy 3,894 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.10-py3 Mixed 192 ImageNet2012 NVIDIA T4
PyTorch Mask R-CNN 576 0.34 AP Segm 42 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.10-py3 Mixed 4 COCO 2014 NVIDIA T4
ResNeXt101 1,738 78.75 Top 1 Accuracy 1,124 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.10-py3 Mixed 128 Imagenet2012 NVIDIA T4
Tacotron2 284 0.52 Training Loss 109,509 total output mels/sec 8x T4 Supermicro SYS-4029GP-TRT 20.10-py3 Mixed 104 LJSpeech 1.1 NVIDIA T4
WaveGlow 1,182 -5.69 Training Loss 350,957 output samples/sec 8x T4 Supermicro SYS-4029GP-TRT 20.10-py3 Mixed 10 LJSpeech 1.1 NVIDIA T4
Transformer 807 27.68 BLEU Score 83,760 words/sec 8x T4 Supermicro SYS-4029GP-TRT 20.07-py3 Mixed 5120 wmt14-en-de NVIDIA T4
FastPitch 319 0.21 Training Loss 281,406 frames/sec 8x T4 Supermicro SYS-4029GP-TRT 20.10-py3 Mixed 32 LJSpeech 1.1 NVIDIA T4
GNMT V2 90 24.27 BLEU Score 160,009 total tokens/sec 8x T4 Supermicro SYS-4029GP-TRT 20.10-py3 Mixed 128 wmt16-en-de NVIDIA T4
NCF 2 0.96 Hit Rate at 10 25,792,995 samples/sec 8x T4 Supermicro SYS-4029GP-TRT 20.10-py3 Mixed 131072 MovieLens 20M NVIDIA T4
BERT-LARGE Fine Tuning 23 91.25 F1 127 sequences/sec 8x T4 Supermicro SYS-4029GP-TRT 20.10-py3 Mixed 10 SQuaD v1.1 NVIDIA T4
TensorFlow ResNet-50 V1.5 639 77.08 Top 1 Accuracy 3,010 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 256 ImageNet2012 NVIDIA T4
SSD v1.2 112 0.29 mAP 548 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.10-py3 Mixed 32 COCO 2017 NVIDIA T4
Mask R-CNN 489 0.34 AP Segm 52 samples/sec 8x T4 Supermicro SYS-4029GP-TRT 20.10-py3 Mixed 4 COCO 2014 NVIDIA T4
U-Net Industrial 2 0.99 IoU Threshold 0.95 286 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.10-py3 Mixed 2 DAGM2007 NVIDIA T4
U-Net Medical 13 0.9 DICE Score 155 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.10-py3 Mixed 8 EM segmentation challenge NVIDIA T4
VAE-CF 2 0.43 NDCG@100 368,484 users processed/sec 8x T4 Supermicro SYS-4029GP-TRT 20.10-py3 Mixed 3072 MovieLens 20M NVIDIA T4
GNMT V2 309 24.26 BLEU Score 64,567 total tokens/sec 8x T4 Supermicro SYS-4029GP-TRT 20.08-py3 Mixed 128 wmt16-en-de NVIDIA T4
BERT-LARGE Fine Tuning 65 91.13 F1 53 sequences/sec 8x T4 Supermicro SYS-4029GP-TRT 20.10-py3 Mixed 3 SQuaD v1.1 NVIDIA T4

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large Fine Tuning: Sequence Length = 384


Deploying AI in real world applications, requires training the networks to convergence at a specified accuracy. This is the best methodology to test AI systems, and is typically done on multi-accelerator systems (see the ‘Training-Convergence’ tab or read our blog on convergence for more details) to shorten training-to-convergence times, especially for recurrent monthly container builds.

Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.

NVIDIA’s complete solution stack, from hardware to software, allows data scientists to deliver unprecedented acceleration at every scale. Visit NVIDIA GPU Cloud (NGC) to pull containers and quickly get up and running with deep learning.

Single GPU Training Performance

Benchmarks are reproducible by following links to NGC scripts

A100 Training Performance

Framework Network Throughput GPU Server Container Precision Batch Size Dataset GPU Version
MXNet ResNet-50 v1.5 2,751 images/sec 1x A100 - - Mixed 408 ImageNet2012 A100-SXM4-80GB
PyTorch Mask R-CNN 23 images/sec 1x A100 - 20.10-py3 Mixed 8 COCO 2014 A100-SXM4-80GB
ResNeXt101 908 images/sec 1x A100 DGX-A100 - Mixed 128 Imagenet2012 A100-SXM4-40GB
SE-ResNeXt101 642 images/sec 1x A100 DGX-A100 - Mixed 128 Imagenet2012 A100-SXM4-40GB
SSD v1.1 427 images/sec 1x A100 - 20.10-py3 Mixed 128 COCO 2017 A100-SXM4-80GB
Tacotron2 31,640 total output mels/sec 1x A100 DGX-A100 20.10-py3 TF32 128 LJSpeech 1.1 A100-SXM4-40GB
WaveGlow 147,276 output samples/sec 1x A100 DGX-A100 20.10-py3 Mixed 10 LJSpeech 1.1 A100-SXM4-40GB
Transformer 77,911 words/sec 1x A100 DGX-A100 20.07-py3 Mixed 10240 wmt14-en-de A100-SXM4-40GB
FastPitch 174,736 frames/sec 1x A100 - 20.10-py3 Mixed 128 LJSpeech 1.1 A100-SXM4-80GB
GNMT V2 140,793 total tokens/sec 1x A100 - 20.10-py3 Mixed 128 wmt16-en-de A100-SXM4-80GB
NCF 35,372,898 samples/sec 1x A100 - 20.10-py3 Mixed 1048576 MovieLens 20M A100-SXM4-80GB
BERT-LARGE Fine Tuning 115 sequences/sec 1x A100 - 20.10-py3 Mixed 32 SQuaD v1.1 A100-SXM4-80GB
TensorFlow SSD v1.2 329 images/sec 1x A100 - 20.10-py3 Mixed 32 COCO 2017 A100-SXM4-80GB
Mask R-CNN 20 samples/sec 1x A100 DGX-A100 20.10-py3 Mixed 4 COCO 2014 A100-SXM4-40GB
U-Net Industrial 307 images/sec 1x A100 DGX-A100 20.10-py3 Mixed 16 DAGM2007 A100-SXM4-40GB
U-Net Medical 138 images/sec 1x A100 - 20.10-py3 Mixed 8 EM segmentation challenge A100-SXM4-80GB
VAE-CF 369,602 users processed/sec 1x A100 DGX-A100 20.10-py3 Mixed 24576 MovieLens 20M A100-SXM4-40GB
GNMT V2 34,549 total tokens/sec 1x A100 - 20.10-py3 TF32 128 wmt16-en-de A100-SXM4-80GB
BERT-LARGE Fine Tuning 107 sequences/sec 1x A100 DGX-A100 20.10-py3 Mixed 24 SQuaD v1.1 A100-SXM4-40GB
NCF 40,020,400 samples/sec 1x A100 DGX-A100 20.10-py3 Mixed 1048576 MovieLens 20M A100-SXM4-40GB
ResNeXt101 1,132 images/sec 1x A100 DGX-A100 - Mixed 256 Imagenet2012 A100-SXM4-40GB
EfficientNet-B4 332 images/sec 1x A100 DGX-A100 - Mixed 160 Imagenet2012 A100-SXM4-80GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large Fine Tuning: Sequence Length = 384
EfficientNet-B4: Basic Augmentation | cuDNN Version = 8.0.5.32 | NCCL Version = 2.7.8 | Installation Source = NGC

V100 Training Performance

Framework Network Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch ResNeXt101 543 images/sec 1x V100 DGX-2 20.11-py3 Mixed 128 Imagenet2012 V100-SXM3-32GB
SE-ResNeXt101 399 images/sec 1x V100 DGX-2 20.11-py3 Mixed 128 Imagenet2012 V100-SXM3-32GB
SSD v1.1 241 images/sec 1x V100 DGX-2 20.10-py3 Mixed 64 COCO 2017 V100-SXM3-32GB
Tacotron2 18,281 total output mels/sec 1x V100 DGX-2 20.08-py3 Mixed 104 LJSpeech 1.1 V100-SXM3-32GB
Transformer 33,652 words/sec 1x V100 DGX-2 20.10-py3 Mixed 5120 wmt14-en-de V100-SXM3-32GB
FastPitch 106,930 frames/sec 1x V100 DGX-2 20.10-py3 Mixed 64 LJSpeech 1.1 V100-SXM3-32GB
GNMT V2 81,360 total tokens/sec 1x V100 DGX-2 20.10-py3 Mixed 128 wmt16-en-de V100-SXM3-32GB
NCF 21,994,770 samples/sec 1x V100 DGX-2 20.10-py3 Mixed 1048576 MovieLens 20M V100-SXM3-32GB
BERT-LARGE Fine Tuning 50 sequences/sec 1x V100 DGX-2 20.10-py3 Mixed 10 SQuaD v1.1 V100-SXM3-32GB
TensorFlow U-Net Industrial 109 images/sec 1x V100 DGX-2 20.10-py3 Mixed 16 DAGM2007 V100-SXM3-32GB
U-Net Medical 64 images/sec 1x V100 DGX-2 20.10-py3 Mixed 8 EM segmentation challenge V100-SXM3-32GB
VAE-CF 222,299 users processed/sec 1x V100 DGX-2 20.10-py3 Mixed 24576 MovieLens 20M V100-SXM3-32GB
NCF 25,694,022 samples/sec 1x V100 DGX-2 20.08-py3 Mixed 1048576 MovieLens 20M V100-SXM3-32GB
BERT-LARGE Fine Tuning 48 sequences/sec 1x V100 DGX-2 20.10-py3 Mixed 10 SQuaD v1.1 V100-SXM3-32GB
ResNeXt101 610 images/sec 1x V100 DGX-2 20.11-py3 Mixed 128 Imagenet2012 V100-SXM3-32GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large Fine Tuning: Sequence Length = 384

T4 Training Performance

Framework Network Throughput GPU Server Container Precision Batch Size Dataset GPU Version
MXNet ResNet-50 v1.5 483 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.10-py3 Mixed 64 ImageNet2012 NVIDIA T4
PyTorch Mask R-CNN 8 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.10-py3 Mixed 4 COCO 2014 NVIDIA T4
ResNeXt101 208 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.11-py3 Mixed 128 Imagenet2012 NVIDIA T4
SE-ResNeXt101 157 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.11-py3 Mixed 128 Imagenet2012 NVIDIA T4
SSD v1.1 85 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.10-py3 Mixed 64 COCO 2017 NVIDIA T4
Tacotron2 15,022 total output mels/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.10-py3 Mixed 104 LJSpeech 1.1 NVIDIA T4
WaveGlow 50,938 output samples/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.10-py3 Mixed 10 LJSpeech 1.1 NVIDIA T4
Transformer 12,869 words/sec 1x T4 Supermicro SYS-4029GP-TRT 20.07-py3 Mixed 5120 wmt14-en-de NVIDIA T4
FastPitch 41,685 frames/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.10-py3 Mixed 64 LJSpeech 1.1 NVIDIA T4
GNMT V2 31,430 total tokens/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.10-py3 Mixed 128 wmt16-en-de NVIDIA T4
NCF 8,000,204 samples/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.10-py3 Mixed 1048576 MovieLens 20M NVIDIA T4
BERT-LARGE Fine Tuning 18 sequences/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.10-py3 Mixed 10 SQuaD v1.1 NVIDIA T4
TensorFlow ResNet-50 V1.5 407 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.10-py3 Mixed 256 ImageNet2012 NVIDIA T4
SSD v1.2 97 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.10-py3 Mixed 32 COCO 2017 NVIDIA T4
U-Net Industrial 41 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.10-py3 Mixed 16 DAGM2007 NVIDIA T4
U-Net Medical 21 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.10-py3 Mixed 8 EM segmentation challenge NVIDIA T4
VAE-CF 81,201 users processed/sec 1x T4 Supermicro SYS-4029GP-TRT 20.06-py3 Mixed 24576 MovieLens 20M NVIDIA T4
GNMT V2 9,810 total tokens/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.10-py3 Mixed 128 wmt16-en-de NVIDIA T4
NCF 10,371,809 samples/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 1048576 MovieLens 20M NVIDIA T4
BERT-LARGE Fine Tuning 13 sequences/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.10-py3 Mixed 3 SQuaD v1.1 NVIDIA T4

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large Fine Tuning: Sequence Length = 384

Real-world AI inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry leading solution enables customers to quickly deploy AI models into real-world production with the highest performance from data centers to the edge.

NVIDIA landed top performance spots on all MLPerf Inference 0.7 tests, the AI-industry’s leading benchmark. NVIDIA® TensorRT™ running on NVIDIA Tensor Core GPUs enable the most efficient deep learning inference performance across multiple application areas and models. This versatility provides wide latitude to data scientists to create the optimal low-latency solution. Visit NVIDIA GPU Cloud (NGC) to download any of these containers and immediately race into production. The inference whitepaper provides and overview of inference platforms.

Measuring inference performance involves balancing a lot of variables. PLASTER is an acronym that describes the key elements for measuring deep learning performance. Each letter identifies a factor (Programmability, Latency, Accuracy, Size of Model, Throughput, Energy Efficiency, Rate of Learning) that must be considered to arrive at the right set of tradeoffs and to produce a successful deep learning implementation. Refer to the PLASTER whitepaper for more details.


MLPerf Inference v0.7 Performance Benchmarks

Offline Scenario

Network Throughput GPU Server Dataset GPU Version
ResNet-50 v1.5 298,647 samples/sec 8x A100 DGX-A100 ImageNet A100-SXM4-40GB
48,898 samples/sec 8x T4 Supermicro 4029GP-TRT-OTO-28 ImageNet NVIDIA T4
SSD ResNet-34 7,788 samples/sec 8x A100 DGX-A100 COCO A100-SXM4-40GB
1,112 samples/sec 8x T4 Supermicro 4029GP-TRT-OTO-28 COCO NVIDIA T4
3D-UNet 328 samples/sec 8x A100 DGX-A100 BraTS 2019 A100-SXM4-40GB
58 samples/sec 8x T4 Supermicro 4029GP-TRT-OTO-28 BraTS 2019 NVIDIA T4
RNN-T 82,401 samples/sec 8x A100 DGX-A100 LibriSpeech A100-SXM4-40GB
11,963 samples/sec 8x T4 Supermicro 4029GP-TRT-OTO-28 LibriSpeech NVIDIA T4
BERT 26,625 samples/sec 8x A100 DGX-A100 SQuAD v1.1 A100-SXM4-40GB
3,495 samples/sec 8x T4 Supermicro 4029GP-TRT-OTO-28 SQuAD v1.1 NVIDIA T4
DLRM 2,113,510 samples/sec 8x A100 DGX-A100 Criteo 1TB Click Logs A100-SXM4-40GB
272,416 samples/sec 8x T4 Supermicro 4029GP-TRT-OTO-28 Criteo 1TB Click Logs NVIDIA T4

MLPerf v0.7 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, 3D U-Net 99.9% accuracy target, RNN-T, BERT 99% accuracy target, DLRM 99.9% accuracy target: 0.7-111, 0.7-113. MLPerf name and logo are trademarks. See www.mlperf.org for more information.

 

Inference Performance of NVIDIA A100, V100 and T4

Benchmarks are reproducible by following links to NGC scripts

Inference Natural Langugage Processing

BERT Inference Throughput

DGX-A100 server w/ 1x NVIDIA A100 with 7 MIG instances of 1g.5gb | Batch Size = 94 | Precision: INT8 | Sequence Length = 128
DGX-1 server w/ 1x NVIDIA V100 | TensorRT 7.1 | Batch Size = 256 | Precision: Mixed | Sequence Length = 128

 

NVIDIA A100 BERT Inference Benchmarks

Network Network
Type
Batch
Size
Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
BERT-Large with Sparsity Attention 94 6,188 sequences/sec - - 1x A100 DGX-A100 - INT8 SQuaD v1.1 - A100 SXM4-40GB

A100 with 7 MIG instances of 1g.5gb | Sequence length=128 | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

Inference Image Classification on CNNs with TensorRT

ResNet-50 v1.5 Throughput

Pre-production server: Platinum 8168 @2.7GHz w/ 1x NVIDIA A100-SXM4-80GB | TensorRT 7.2.1 | Batch Size = 128 | 20.10-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 7.2.1 | Batch Size = 128 | 20.10-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 7.2.1 | Batch Size = 128 | 20.10-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Latency

Pre-production server: Platinum 8168 @2.7GHz w/ 1x NVIDIA A100-SXM4-80GB | TensorRT 7.2.1 | Batch Size = 1 | 20.10-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 7.2.1 | Batch Size = 1 | 20.10-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 7.2.1 | Batch Size = 1 | 20.10-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Power Efficiency

Pre-production server: Platinum 8168 @2.7GHz w/ 1x NVIDIA A100-SXM4-80GB | TensorRT 7.2.1 | Batch Size = 128 | 20.10-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 7.2.1 | Batch Size = 128 | 20.10-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 7.2.1 | Batch Size = 128 | 20.10-py3 | Precision: INT8 | Dataset: Synthetic

 

A100 Inference Performance

Network Batch Size 1/7 MIG Throughput 7 MIG Throughput Full Chip Throughput Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50 1 - - 2,181 images/sec 0.46 1x A100 - 20.10-py3 INT8 Synthetic TensorRT 7.2.1 A100-SXM4-80GB
2 - - 3,971 images/sec 0.50 1x A100 - 20.10-py3 INT8 Synthetic TensorRT 7.2.1 A100-SXM4-80GB
8 - - 11,186 images/sec 0.72 1x A100 - 20.10-py3 INT8 Synthetic TensorRT 7.2.1 A100-SXM4-80GB
128 - - 28,463 images/sec 4.50 1x A100 - 20.10-py3 INT8 Synthetic TensorRT 7.2.1 A100-SXM4-80GB
211 - - 30,726 images/sec 6.87 1x A100 - 20.10-py3 INT8 Synthetic TensorRT 7.2.1 A100-SXM4-80GB
ResNet-50v1.5 1 - - 2,116 images/sec 0.47 1x A100 - 20.10-py3 INT8 Synthetic TensorRT 7.2.1 A100-SXM4-80GB
2 - - 3,973 images/sec 0.50 1x A100 - 20.10-py3 INT8 Synthetic TensorRT 7.2.1 A100-SXM4-80GB
8 - - 10,940 images/sec 0.73 1x A100 - 20.10-py3 INT8 Synthetic TensorRT 7.2.1 A100-SXM4-80GB
128 - - 27,443 images/sec 4.66 1x A100 - 20.10-py3 INT8 Synthetic TensorRT 7.2.1 A100-SXM4-80GB
206 - - 29,550 images/sec 6.97 1x A100 - 20.10-py3 INT8 Synthetic TensorRT 7.2.1 A100-SXM4-80GB
236 - - 34,249 images/sec 6.89 1x A100 - - INT8 Synthetic TensorRT 7.2 A100-SXM4-40GB
ResNext101 32 - - 7,674 images/sec 4.17 1x A100 - - INT8 Synthetic TensorRT 7.2.2 A100-SXM4-40GB
EfficientNet-B0 128 - - 22,346 images/sec 5.73 1x A100 - - INT8 Synthetic TensorRT 7.2.1.6 A100-SXM4-40GB
BERT-BASE 1 590 sequences/sec - 1,341 sequences/sec 0.75 1x A100 DGX-A100 - INT8 Real (Q&A provided as text input) TensorRT 7.2 A100-SXM4-40GB
2 888 sequences/sec - 2,416 sequences/sec 0.83 1x A100 DGX-A100 - INT8 Real (Q&A provided as text input) TensorRT 7.2 A100-SXM4-40GB
8 1,455 sequences/sec - 6,830 sequences/sec 1.17 1x A100 DGX-A100 - INT8 Real (Q&A provided as text input) TensorRT 7.2 A100-SXM4-40GB
128 2,101 sequences/sec - 13,697 sequences/sec 9.35 1x A100 DGX-A100 - INT8 Real (Q&A provided as text input) TensorRT 7.2 A100-SXM4-40GB
256 2,142 sequences/sec - 14,490 sequences/sec 17.67 1x A100 DGX-A100 - INT8 Real (Q&A provided as text input) TensorRT 7.2 A100-SXM4-40GB
BERT-LARGE 1 241 sequences/sec - 585 sequences/sec 1.71 1x A100 DGX-A100 - INT8 Real (Q&A provided as text input) TensorRT 7.2 A100-SXM4-40GB
2 312 sequences/sec - 1,068 sequences/sec 1.87 1x A100 DGX-A100 - INT8 Real (Q&A provided as text input) TensorRT 7.2 A100-SXM4-40GB
8 531 sequences/sec - 2,152 sequences/sec 3.72 1x A100 DGX-A100 - INT8 Real (Q&A provided as text input) TensorRT 7.2 A100-SXM4-40GB
12 - - 2,804 sequences/sec 4.3 1x A100 DGX-A100 20.11-py3 INT8 Real (Q&A provided as text input) TensorRT 7.2 A100-SXM4-40GB
128 657 sequences/sec - 4,481 sequences/sec 28.57 1x A100 DGX-A100 - INT8 Real (Q&A provided as text input) TensorRT 7.2 A100-SXM4-40GB
256 681 sequences/sec 4,741 sequences/sec 4,679 sequences/sec 54.71 1x A100 - - INT8 Real (Q&A provided as text input) TensorRT 7.2 A100-SXM4-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

 

V100 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50 1 1,241 images/sec 7.8 images/sec/watt 0.81 1x V100 DGX-2 20.10-py3 INT8 Synthetic TensorRT 7.2.1 V100-SXM3-32GB
2 1,995 images/sec 11 images/sec/watt 1 1x V100 DGX-2 20.10-py3 Mixed Synthetic TensorRT 7.2.1 V100-SXM3-32GB
8 4,323 images/sec 17 images/sec/watt 1.9 1x V100 DGX-2 20.10-py3 Mixed Synthetic TensorRT 7.2.1 V100-SXM3-32GB
52 7,833 images/sec - 6.6 1x V100 DGX-2 20.10-py3 Mixed Synthetic TensorRT 7.2.1 V100-SXM3-32GB
128 8,115 images/sec 24 images/sec/watt 16 1x V100 DGX-2 20.10-py3 Mixed Synthetic TensorRT 7.2.1 V100-SXM3-32GB
ResNet-50v1.5 1 1,134 images/sec 7.3 images/sec/watt 0.88 1x V100 DGX-2 20.10-py3 Mixed Synthetic TensorRT 7.2.1 V100-SXM3-32GB
2 1,985 images/sec 11 images/sec/watt 1 1x V100 DGX-2 20.10-py3 Mixed Synthetic TensorRT 7.2.1 V100-SXM3-32GB
8 4,183 images/sec 16 images/sec/watt 1.9 1x V100 DGX-2 20.10-py3 Mixed Synthetic TensorRT 7.2.1 V100-SXM3-32GB
52 7,540 images/sec - 6.9 1x V100 DGX-2 20.10-py3 Mixed Synthetic TensorRT 7.2.1 V100-SXM3-32GB
128 7,797 images/sec 23 images/sec/watt 16 1x V100 DGX-2 20.10-py3 Mixed Synthetic TensorRT 7.2.1 V100-SXM3-32GB
BERT-BASE 1 817 sequences/sec 4.3 sequences/sec/watt 1.2 1x V100 DGX-2 20.11-py3 INT8 Sample Text TensorRT 7.2.1 V100-SXM3-32GB
2 1,307 sequences/sec - 1.5 1x V100 DGX-2 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB
8 2,389 sequences/sec - 3.4 1x V100 DGX-2 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB
26 3,002 sequences/sec 13.36 sequences/sec/watt 8.66 1x V100 DGX-2 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB
128 3,194 sequences/sec 11 sequences/sec/watt 40 1x V100 DGX-2 20.11-py3 INT8 Sample Text TensorRT 7.2.1 V100-SXM3-32GB
BERT-LARGE 1 310 sequences/sec 1.6 sequences/sec/watt 3.2 1x V100 DGX-2 20.11-py3 INT8 Sample Text TensorRT 7.2.1 V100-SXM3-32GB
2 507 sequences/sec 3.68 sequences/sec/watt 3.95 1x V100 DGX-2 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB
8 792 sequences/sec 3 sequences/sec/watt 10 1x V100 DGX-2 20.11-py3 INT8 Sample Text TensorRT 7.2.1 V100-SXM3-32GB
128 1,043 sequences/sec 3.1 sequences/sec/watt 123 1x V100 DGX-2H 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB-H
WaveGlow 1 614,788 output samples/sec - 0.37 1x V100 DGX-2 20.10-py3 Mixed LJSpeech 1.1 PyTorch 1.7.0a0+7036e91 V100-SXM3-32GB

NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

T4 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU
Version
ResNet-50 1 1,408 images/sec 20 images/sec/watt 0.71 1x T4 Supermicro SYS-4029GP-TRT 20.10-py3 INT8 Synthetic TensorRT 7.2.1 NVIDIA T4
2 1,987 images/sec 29 images/sec/watt 1 1x T4 Supermicro SYS-4029GP-TRT 20.10-py3 INT8 Synthetic TensorRT 7.2.1 NVIDIA T4
8 3,884 images/sec 56 images/sec/watt 2.1 1x T4 Supermicro SYS-4029GP-TRT 20.10-py3 INT8 Synthetic TensorRT 7.2.1 NVIDIA T4
32 4,914 images/sec - 6.5 1x T4 Supermicro SYS-4029GP-TRT 20.10-py3 INT8 Synthetic TensorRT 7.2.1 NVIDIA T4
128 5,271 images/sec 75 images/sec/watt 24 1x T4 Supermicro SYS-4029GP-TRT 20.10-py3 INT8 Synthetic TensorRT 7.2.1 NVIDIA T4
ResNet-50v1.5 1 1,382 images/sec 20 images/sec/watt 0.72 1x T4 Supermicro SYS-4029GP-TRT 20.10-py3 INT8 Synthetic TensorRT 7.2.1 NVIDIA T4
2 2,186 images/sec 31 images/sec/watt 0.92 1x T4 Supermicro SYS-4029GP-TRT 20.10-py3 INT8 Synthetic TensorRT 7.2.1 NVIDIA T4
8 3,785 images/sec 54 images/sec/watt 2.1 1x T4 Supermicro SYS-4029GP-TRT 20.10-py3 INT8 Synthetic TensorRT 7.2.1 NVIDIA T4
30 4,585 images/sec - 6.5 1x T4 Supermicro SYS-4029GP-TRT 20.10-py3 INT8 Synthetic TensorRT 7.2.1 NVIDIA T4
128 4,846 images/sec 69 images/sec/watt 26 1x T4 Supermicro SYS-4029GP-TRT 20.10-py3 INT8 Synthetic TensorRT 7.2.1 NVIDIA T4
BERT-BASE 1 725 sequences/sec 11 sequences/sec/watt 1.4 1x T4 Supermicro SYS-1029GQ-TRT 20.11-py3 INT8 Sample Text TensorRT 7.2.1 NVIDIA T4
2 1079 sequences/sec 17 sequences/sec/watt 1.9 1x T4 Supermicro SYS-1029GQ-TRT 20.11-py3 INT8 Sample Text TensorRT 7.2.1 NVIDIA T4
8 1,720 sequences/sec 28 sequences/sec/watt 4.7 1x T4 Supermicro SYS-1029GQ-TRT 20.11-py3 INT8 Sample Text TensorRT 7.2.1 NVIDIA T4
128 1,818 sequences/sec 28 sequences/sec/watt 70 1x T4 Supermicro SYS-1029GQ-TRT 20.11-py3 INT8 Sample Text TensorRT 7.2.1 NVIDIA T4
BERT-LARGE 1 261 sequences/sec 4.2 sequences/sec/watt 3.8 1x T4 Supermicro SYS-1029GQ-TRT 20.11-py3 INT8 Sample Text TensorRT 7.2.1 NVIDIA T4
2 390 sequences/sec 6.2 sequences/sec/watt 5.1 1x T4 Supermicro SYS-1029GQ-TRT 20.11-py3 INT8 Sample Text TensorRT 7.2.1 NVIDIA T4
8 555 sequences/sec 8.9 sequences/sec/watt 14 1x T4 Supermicro SYS-1029GQ-TRT 20.11-py3 INT8 Sample Text TensorRT 7.2.1 NVIDIA T4
128 561 sequences/sec 8.3 sequences/sec/watt 228 1x T4 Supermicro SYS-1029GQ-TRT 20.11-py3 INT8 Sample Text TensorRT 7.2.1 NVIDIA T4
WaveGlow 1 187,595 output samples/sec - 1.2 1x T4 Supermicro SYS-4029GP-TRT 20.10-py3 Mixed LJSpeech 1.1 PyTorch 1.7.0a0+7036e91 NVIDIA T4

NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

Last updated: January 15th, 2021