Review the latest GPU acceleration factors of popular HPC applications.

Please refer to Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide for instructions on how to reproduce these performance claims.


NVIDIA’s complete solution stack, from GPUs to libraries, and containers on NVIDIA GPU Cloud (NGC), allows data scientists to quickly get up and running with deep learning. NVIDIA® A100 Tensor Core GPU provides unprecedented acceleration at every scale and across every framework and type of neural network. NVIDIA® V100 Tensor Core GPUs leverage mixed precision to accelerate deep learning training and breaks performance records on MLPerf, the AI’s first industry-wide benchmark, a testament to our GPU-accelerated platform approach.

NVIDIA Performance on MLPerf 0.6 AI Benchmarks

ResNet-50 v1.5 Time to Solution on V100

MXNet | Batch Size refer to CNN V100 Training table below | Precision: Mixed | Dataset: ImageNet2012 | Convergence criteria - refer to MLPerf requirements

MLPerf Training Performance

NVIDIA Performance on MLPerf 0.6 AI Benchmarks

Framework Network Network Type Time to Solution GPU Server MLPerf-ID Precision Dataset GPU Version
MXNet ResNet-50 v1.5 CNN 115.22 minutes 8x V100 DGX-1 0.6-8 Mixed ImageNet2012 V100-SXM2-16GB
CNN 57.87 minutes 16x V100 DGX-2 0.6-17 Mixed ImageNet2012 V100-SXM3-32GB
CNN 52.74 minutes 16x V100 DGX-2H 0.6-19 Mixed ImageNet2012 V100-SXM3-32GB-H
CNN 2.59 minutes 512x V100 DGX-2H 0.6-29 Mixed ImageNet2012 V100-SXM3-32GB-H
CNN 1.69 minutes 1040x V100 DGX-1 0.6-16 Mixed ImageNet2012 V100-SXM2-16GB
CNN 1.33 minutes 1536x V100 DGX-2H 0.6-30 Mixed ImageNet2012 V100-SXM3-32GB-H
PyTorch SSD-ResNet-34 CNN 22.36 minutes 8x V100 DGX-1 0.6-9 Mixed COCO2017 V100-SXM2-16GB
CNN 12.21 minutes 16x V100 DGX-2 0.6-18 Mixed COCO2017 V100-SXM3-32GB
CNN 11.41 minutes 16x V100 DGX-2H 0.6-20 Mixed COCO2017 V100-SXM3-32GB-H
CNN 4.78 minutes 64x V100 DGX-2H 0.6-21 Mixed COCO2017 V100-SXM3-32GB-H
CNN 2.67 minutes 240x V100 DGX-1 0.6-13 Mixed COCO2017 V100-SXM2-16GB
CNN 2.56 minutes 240x V100 DGX-2H 0.6-24 Mixed COCO2017 V100-SXM3-32GB-H
CNN 2.23 minutes 240x V100 DGX-2H 0.6-27 Mixed COCO2017 V100-SXM3-32GB-H
Mask R-CNN CNN 207.48 minutes 8x V100 DGX-1 0.6-9 Mixed COCO2017 V100-SXM2-16GB
CNN 101 minutes 16x V100 DGX-2 0.6-18 Mixed COCO2017 V100-SXM3-32GB
CNN 95.2 minutes 16x V100 DGX-2H 0.6-20 Mixed COCO2017 V100-SXM3-32GB-H
CNN 32.72 minutes 64x V100 DGX-2H 0.6-21 Mixed COCO2017 V100-SXM3-32GB-H
CNN 22.03 minutes 192x V100 DGX-1 0.6-12 Mixed COCO2017 V100-SXM2-16GB
CNN 18.47 minutes 192x V100 DGX-2H 0.6-23 Mixed COCO2017 V100-SXM3-32GB-H
PyTorch GNMT RNN 20.55 minutes 8x V100 DGX-1 0.6-9 Mixed WMT16 English-German V100-SXM2-16GB
RNN 10.94 minutes 16x V100 DGX-2 0.6-18 Mixed WMT16 English-German V100-SXM3-32GB
RNN 9.87 minutes 16x V100 DGX-2H 0.6-20 Mixed WMT16 English-German V100-SXM3-32GB-H
RNN 2.12 minutes 256x V100 DGX-2H 0.6-25 Mixed WMT16 English-German V100-SXM3-32GB-H
RNN 1.99 minutes 384x V100 DGX-1 0.6-14 Mixed WMT16 English-German V100-SXM2-16GB
RNN 1.8 minutes 384x V100 DGX-2H 0.6-26 Mixed WMT16 English-German V100-SXM3-32GB-H
PyTorch Transformer Attention 20.34 minutes 8x V100 DGX-1 0.6-9 Mixed WMT17 English-German V100-SXM2-16GB
Attention 11.04 minutes 16x V100 DGX-2 0.6-18 Mixed WMT17 English-German V100-SXM3-32GB
Attention 9.8 minutes 16x V100 DGX-2H 0.6-20 Mixed WMT17 English-German V100-SXM3-32GB-H
Attention 2.41 minutes 160x V100 DGX-2H 0.6-22 Mixed WMT17 English-German V100-SXM3-32GB-H
Attention 2.05 minutes 480x V100 DGX-1 0.6-15 Mixed WMT17 English-German V100-SXM2-16GB
Attention 1.59 minutes 480x V100 DGX-2H 0.6-28 Mixed WMT17 English-German V100-SXM3-32GB-H
TensorFlow MiniGo Reinforcement Learning 27.39 minutes 8x V100 DGX-1 0.6-10 Mixed N/A V100-SXM2-16GB
Reinforcement Learning 13.57 minutes 24x V100 DGX-1 0.6-11 Mixed N/A V100-SXM2-16GB

Training Natural Language Processing

BERT Pre-Training Throughput

DGX A100 server w/ 8x NVIDIA A100 on PyTorch | DGX-1 server w/ 8x NVIDIA V100 on PyTorch (2/3)Phase 1 and (1/3)Phase 2 | Precision: FP16 for A100 and Mixed for V100 | Sequence Length for Phase 1= 128 and Phase 2 = 512

NVIDIA A100 BERT Training Benchmarks

Framework Network Network Type Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch BERT Pre-Training Attention 2,274 sequences/sec 8x A100 DGX A100 - FP16 - Wikipedia+BookCorpus A100 SXM4-40GB

DGX A100 server w/ 8x NVIDIA A100 on PyTorch (2/3)Phase 1 and (1/3)Phase 2 | Sequence Length for Phase 1 = 128 and Phase 2 = 512

Training Image Classification on CNNs

ResNet-50 v1.5 Throughput on V100

DGX-1: 8x NVIDIA V100-SXM2-16GB for MXNet and PyTorch. 8x NVIDIA V100-SXM2-32GB for TensorFlow, E5-2698 v4@2.2 GHz | Batch Size: MXNet = 208, PyTorch = 256 and TensorFlow = 512 | 20.03-py3 | Precision: Mixed | Dataset: ImageNet2012

ResNet-50 v1.5 Throughput on T4

Supermicro SYS-4029GP-TRT T4: 8x NVIDIA T4, Gold 6240@2.6 GHz for MXNet, PyTorch and TensorFlow | Batch Size: MXNet = 208, PyTorch and TensorFlow = 256 | 20.03-py3 | Precision: Mixed | Dataset: ImageNet2012

Training Performance

V100 Training Performance

Framework Network Throughput GPU Server Container Precision Batch Size Dataset GPU Version
MXNet Inception V3 554 images/sec 1x V100 DGX-1 20.03-py3 Mixed 208 ImageNet2012 V100-SXM2-16GB
623 images/sec 1x V100 DGX-2H 20.03-py3 Mixed 384 ImageNet2012 V100-SXM3-32GB-H
4,193 images/sec 8x V100 DGX-1 20.03-py3 Mixed 208 ImageNet2012 V100-SXM2-16GB
4,764 images/sec 8x V100 DGX-2H 20.03-py3 Mixed 384 ImageNet2012 V100-SXM3-32GB-H
ResNet-50 1,382 images/sec 1x V100 DGX-1 19.11-py3 Mixed 208 Imagenet2012 V100-SXM2-16GB
1,445 images/sec 1x V100 DGX-2 19.11-py3 Mixed 512 Imagenet2012 V100-SXM3-32GB
1,551 images/sec 1x V100 DGX-2H 19.11-py3 Mixed 512 Imagenet2012 V100-SXM3-32GB-H
10,358 images/sec 8x V100 DGX-1 19.11-py3 Mixed 192 Imagenet2012 V100-SXM2-16GB
10,805 images/sec 8x V100 DGX-2 19.11-py3 Mixed 256 Imagenet2012 V100-SXM3-32GB
11,507 images/sec 8x V100 DGX-2H 19.11-py3 Mixed 256 Imagenet2012 V100-SXM3-32GB-H
ResNet-50 v1.5 1,474 images/sec 1x V100 DGX-1 20.03-py3 Mixed 208 ImageNet2012 V100-SXM2-16GB
1,531 images/sec 1x V100 DGX-2 20.03-py3 Mixed 256 ImageNet2012 V100-SXM3-32GB
1,669 images/sec 1x V100 DGX-2H 20.03-py3 Mixed 256 ImageNet2012 V100-SXM3-32GB-H
10,615 images/sec 8x V100 DGX-1 20.03-py3 Mixed 208 ImageNet2012 V100-SXM2-16GB
11,543 images/sec 8x V100 DGX-2 20.03-py3 Mixed 256 ImageNet2012 V100-SXM3-32GB
12,101 images/sec 8x V100 DGX-2H 20.03-py3 Mixed 256 ImageNet2012 V100-SXM3-32GB-H
PyTorch Inception V3 532 images/sec 1x V100 DGX-1 20.03-py3 Mixed 256 ImageNet2012 V100-SXM2-16GB
589 images/sec 1x V100 DGX-2H 20.03-py3 Mixed 512 ImageNet2012 V100-SXM3-32GB-H
4,057 images/sec 8x V100 DGX-1 20.03-py3 Mixed 256 ImageNet2012 V100-SXM2-16GB
Mask R-CNN 15 images/sec 1x V100 DGX-1 20.03-py3 Mixed 16 COCO 2014 V100-SXM2-32GB
18 images/sec 1x V100 DGX-2H 20.03-py3 Mixed 16 COCO 2014 V100-SXM3-32GB-H
92 images/sec 8x V100 DGX-1 20.03-py3 Mixed 16 COCO 2014 V100-SXM2-32GB
ResNet-50 905 images/sec 1x V100 DGX-1 19.11_py3 Mixed 256 ImageNet2012 V100-SXM2-16GB
926 images/sec 1x V100 DGX-2 19.11_py3 Mixed 512 ImageNet2012 V100-SXM3-32GB
1,025 images/sec 1x V100 DGX-2H 19.11_py3 Mixed 512 ImageNet2012 V100-SXM3-32GB-H
6,179 images/sec 8x V100 DGX-1 19.11_py3 Mixed 256 ImageNet2012 V100-SXM2-16GB
6,595 images/sec 8x V100 DGX-2 19.11_py3 Mixed 512 ImageNet2012 V100-SXM3-32GB
7,151 images/sec 8x V100 DGX-2H 19.11_py3 Mixed 512 ImageNet2012 V100-SXM3-32GB-H
ResNet-50 V1.5 851 images/sec 1x V100 DGX-1 20.03-py3 Mixed 256 ImageNet2012 V100-SXM2-16GB
927 images/sec 1x V100 DGX-2H 20.03-py3 Mixed 512 ImageNet2012 V100-SXM3-32GB-H
6,615 images/sec 8x V100 DGX-1 20.03-py3 Mixed 256 ImageNet2012 V100-SXM2-16GB
ResNeXt101 312 images/sec 1x V100 DGX-1 20.03-py3 Mixed 128 Imagenet2012 V100-SXM2-16GB
332 images/sec 1x V100 DGX-2H 20.03-py3 Mixed 128 Imagenet2012 V100-SXM3-32GB-H
2,382 images/sec 8x V100 DGX-1 20.03-py3 Mixed 128 Imagenet2012 V100-SXM2-16GB
SE-ResNeXt101 266 images/sec 1x V100 DGX-1 20.03-py3 Mixed 128 Imagenet2012 V100-SXM2-16GB
286 images/sec 1x V100 DGX-2H 20.03-py3 Mixed 128 Imagenet2012 V100-SXM3-32GB-H
2,037 images/sec 8x V100 DGX-1 20.03-py3 Mixed 128 Imagenet2012 V100-SXM2-16GB
SSD v1.1 231 images/sec 1x V100 DGX-1 20.03-py3 Mixed 64 COCO 2017 V100-SXM2-16GB
260 images/sec 1x V100 DGX-2H 20.03-py3 Mixed 64 COCO 2017 V100-SXM3-32GB-H
1,795 images/sec 8x V100 DGX-1 20.03-py3 Mixed 64 COCO 2017 V100-SXM2-16GB
Tacotron2 16,129 total output mels/sec 1x V100 DGX-1 20.03-py3 Mixed 104 LJSpeech 1.1 V100-SXM2-32GB
20,651 total output mels/sec 1x V100 DGX-2H 20.01-py3 Mixed 104 LJSpeech 1.1 V100-SXM3-32GB-H
96,458 total output mels/sec 8x V100 DGX-1 20.03-py3 Mixed 104 LJSpeech 1.1 V100-SXM2-16GB
121,697 total output mels/sec 8x V100 DGX-2H 20.03-py3 Mixed 104 LJSpeech 1.1 V100-SXM3-32GB-H
WaveGlow 80,328 output samples/sec 1x V100 DGX-1 20.03-py3 Mixed 10 LJSpeech 1.1 V100-SXM2-16GB
91,194 output samples/sec 1x V100 DGX-2H 20.03-py3 Mixed 10 LJSpeech 1.1 V100-SXM3-32GB-H
548,178 output samples/sec 8x V100 DGX-1 20.03-py3 Mixed 10 LJSpeech 1.1 V100-SXM2-16GB
565,228 output samples/sec 8x V100 DGX-2H 20.02-py3 Mixed 10 LJ Speech 1.1 V100-SXM3-32GB-H
Jasper 41 sequences/sec 1x V100 DGX-1 20.03-py3 Mixed 64 LibriSpeech V100-SXM2-32GB
49 sequences/sec 1x V100 DGX-2H 20.03-py3 Mixed 64 LibriSpeech V100-SXM3-32GB-H
301 sequences/sec 8x V100 DGX-1 20.03-py3 Mixed 64 LibriSpeech V100-SXM2-32GB
Transformer 34,029 words/sec 1x V100 DGX-1 19.12-py3 Mixed 5120 wmt14-en-de V100-SXM2-16GB
42,220 words/sec 1x V100 DGX-2H 19.12-py3 Mixed 5120 wmt14-en-de V100-SXM3-32GB-H
243,712 words/sec 8x V100 DGX-1 20.01-py3 Mixed 5120 wmt14-en-de V100-SXM2-16GB
275,755 words/sec 8x V100 DGX-2H 19.12-py3 Mixed 5120 wmt14-en-de V100-SXM3-32GB-H
Transformer XL 29,697 total tokens/sec 1x V100 DGX-1 20.03-py3 Mixed 32 WikiText-103 V100-SXM2-16GB
33,992 total tokens/sec 1x V100 DGX-2H 20.03-py3 Mixed 32 WikiText-103 V100-SXM3-32GB-H
225,669 total tokens/sec 8x V100 DGX-1 20.03-py3 Mixed 32 WikiText-103 V100-SXM2-16GB
242,612 total tokens/sec 8x V100 DGX-2H 20.03-py3 Mixed 32 WikiText-103 V100-SXM3-32GB-H
TensorFlow Inception V3 794 images/sec 1x V100 DGX-1 20.02-py3 Mixed 384 ImageNet2012 V100-SXM2-32GB
923 images/sec 1x V100 DGX-2H 20.02-py3 Mixed 384 ImageNet2012 V100-SXM3-32GB-H
5,977 images/sec 8x V100 DGX-1 20.02-py3 Mixed 384 ImageNet2012 V100-SXM2-32GB
6,787 images/sec 8x V100 DGX-2H 20.02-py3 Mixed 384 ImageNet2012 V100-SXM3-32GB-H
Mask R-CNN 9 samples/sec 1x V100 DGX-1 20.03-py3 Mixed 4 COCO 2017 V100-SXM2-32GB
10 samples/sec 1x V100 DGX-2H 20.03-py3 Mixed 4 COCO 2017 V100-SXM3-32GB-H
62 samples/sec 8x V100 DGX-1 20.03-py3 Mixed 4 COCO 2017 V100-SXM2-32GB
85 samples/sec 8x V100 DGX-2H 20.03-py3 Mixed 4 COCO 2017 V100-SXM3-32GB-H
ResNet-50 V1.5 1,061 images/sec 1x V100 DGX-1 20.03-py3 Mixed 256 ImageNet2012 V100-SXM2-16GB
1,239 images/sec 1x V100 DGX-2H 20.03-py3 Mixed 512 ImageNet2012 V100-SXM3-32GB-H
8,111 images/sec 8x V100 DGX-1 20.03-py3 Mixed 512 ImageNet2012 V100-SXM2-32GB
8,461 images/sec 8x V100 DGX-2H 20.03-py3 Mixed 512 ImageNet2012 V100-SXM3-32GB-H
SSD v1.2 127 images/sec 1x V100 DGX-1 20.03-py3 Mixed 32 COCO 2017 V100-SXM2-16GB
150 images/sec 1x V100 DGX-2H 20.03-py3 Mixed 32 COCO 2017 V100-SXM3-32GB-H
641 images/sec 8x V100 DGX-1 20.03-py3 Mixed 32 COCO 2017 V100-SXM2-16GB
776 images/sec 8x V100 DGX-2H 20.03-py3 Mixed 32 COCO 2017 V100-SXM3-32GB-H
U-Net Industrial 100 images/sec 1x V100 DGX-1 20.03-py3 Mixed 16 DAGM2007 V100-SXM2-16GB
111 images/sec 1x V100 DGX-2H 20.03-py3 Mixed 16 DAGM2007 V100-SXM3-32GB-H
520 images/sec 8x V100 DGX-1 20.03-py3 Mixed 2 DAGM2007 V100-SXM2-16GB
583 images/sec 8x V100 DGX-2H 20.03-py3 Mixed 2 DAGM2007 V100-SXM3-32GB-H
U-Net Medical 54 images/sec 1x V100 DGX-1 20.03-py3 Mixed 8 EM segmentation challenge V100-SXM2-16GB
62 images/sec 1x V100 DGX-2H 20.03-py3 Mixed 8 EM segmentation challenge V100-SXM3-32GB-H
368 images/sec 8x V100 DGX-1 20.03-py3 Mixed 8 EM segmentation challenge V100-SXM2-16GB
415 images/sec 8x V100 DGX-2H 20.03-py3 Mixed 8 EM segmentation challenge V100-SXM3-32GB-H
V-Net Medical 586 images/sec 1x V100 DGX-1 20.03-py3 Mixed 32 Hippocampus head and body from Medical Segmentation Decathlon V100-SXM2-16GB
647 images/sec 1x V100 DGX-2H 20.03-py3 Mixed 32 Hippocampus head and body from Medical Segmentation Decathlon V100-SXM3-32GB-H
3,578 images/sec 8x V100 DGX-1 20.03-py3 Mixed 32 Hippocampus head and body from Medical Segmentation Decathlon V100-SXM2-16GB
3,770 images/sec 8x V100 DGX-2H 20.03-py3 Mixed 32 Hippocampus head and body from Medical Segmentation Decathlon V100-SXM3-32GB-H
VAE-CF 219,589 users processed/sec 1x V100 DGX-1 20.03-py3 Mixed 24576 MovieLens 20M V100-SXM2-16GB
242,117 users processed/sec 1x V100 DGX-2H 20.03-py3 Mixed 24576 MovieLens 20M V100-SXM3-32GB-H
288,873 users processed/sec 8x V100 DGX-1 20.03-py3 FP32 3072 MovieLens 20M V100-SXM2-32GB
PyTorch GNMT V2 57,404 total tokens/sec 1x V100 DGX-1 20.03-py3 Mixed 128 wmt16-en-de V100-SXM2-16GB
65,830 total tokens/sec 1x V100 DGX-2H 20.03-py3 Mixed 128 wmt16-en-de V100-SXM3-32GB-H
413,382 total tokens/sec 8x V100 DGX-1 20.03-py3 Mixed 128 wmt16-en-de V100-SXM2-16GB
453,784 total tokens/sec 8x V100 DGX-2H 20.03-py3 Mixed 128 wmt16-en-de V100-SXM3-32GB-H
TensorFlow GNMT V2 20,722 total tokens/sec 1x V100 DGX-1 20.03-py3 Mixed 128 wmt16-en-de V100-SXM2-16GB
24,377 total tokens/sec 1x V100 DGX-2H 20.03-py3 Mixed 128 wmt16-en-de V100-SXM3-32GB-H
126,197 total tokens/sec 8x V100 DGX-1 20.03-py3 Mixed 128 wmt16-en-de V100-SXM2-16GB
PyTorch NCF 22,109,870 samples/sec 1x V100 DGX-1 20.03-py3 Mixed 1048576 MovieLens 20M V100-SXM2-16GB
24,586,901 samples/sec 1x V100 DGX-2H 20.03-py3 Mixed 1048576 MovieLens 20M V100-SXM3-32GB-H
100,318,081 samples/sec 8x V100 DGX-1 20.03-py3 Mixed 1048576 MovieLens 20M V100-SXM2-16GB
107,394,252 samples/sec 8x V100 DGX-2H 20.03-py3 Mixed 1048576 MovieLens 20M V100-SXM3-32GB-H
TensorFlow NCF 26,677,181 samples/sec 1x V100 DGX-1 20.03-py3 Mixed 1048576 MovieLens 20M V100-SXM2-16GB
28,357,284 samples/sec 1x V100 DGX-2H 20.03-py3 Mixed 1048576 MovieLens 20M V100-SXM3-32GB-H
79,648,119 samples/sec 8x V100 DGX-1 20.03-py3 Mixed 1048576 MovieLens 20M V100-SXM2-16GB
PyTorch BERT Pre-Training 822 sequences/sec 8x V100 DGX-1 - Mixed - Wikipedia+BookCorpus V100-SXM2-16GB

BERT Pre-Training throughput using PyTorch including (2/3)Phase 1 and (1/3)Phase 2 | Sequence Length for Phase 1 = 128 and Phase 2 = 512

T4 Training Performance

Framework Network Throughput GPU Server Container Precision Batch Size Dataset GPU Version
MXNet Inception V3 185 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 208 ImageNet2012 NVIDIA T4
1,437 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 208 ImageNet2012 NVIDIA T4
ResNet-50 v1.5 487 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 208 ImageNet2012 NVIDIA T4
3,814 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 208 ImageNet2012 NVIDIA T4
PyTorch Inception V3 184 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 256 ImageNet2012 NVIDIA T4
1,426 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 256 ImageNet2012 NVIDIA T4
Mask R-CNN 6 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 4 COCO 2014 NVIDIA T4
39 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 4 COCO 2014 NVIDIA T4
ResNet-50 V1.5 284 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 256 ImageNet2012 NVIDIA T4
2,272 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 256 ImageNet2012 NVIDIA T4
ResNeXt101 116 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 128 Imagenet2012 NVIDIA T4
905 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 128 Imagenet2012 NVIDIA T4
SE-ResNeXt101 103 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 128 Imagenet2012 NVIDIA T4
792 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 128 Imagenet2012 NVIDIA T4
SSD v1.1 78 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 64 COCO 2017 NVIDIA T4
631 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 64 COCO 2017 NVIDIA T4
Tacotron2 13,407 total output mels/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 104 LJSpeech 1.1 NVIDIA T4
93,127 total output mels/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 104 LJSpeech 1.1 NVIDIA T4
WaveGlow 35,477 output samples/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 10 LJSpeech 1.1 NVIDIA T4
252,348 output samples/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 10 LJSpeech 1.1 NVIDIA T4
Jasper 13 sequences/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 32 LibriSpeech NVIDIA T4
98 sequences/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 32 LibriSpeech NVIDIA T4
Transformer 12,037 words/sec 1x T4 Supermicro SYS-4029GP-TRT 19.12-py3 Mixed 5120 wmt14-en-de NVIDIA T4
68,177 words/sec 8x T4 Supermicro SYS-4029GP-TRT 19.12-py3 Mixed 5120 wmt14-en-de NVIDIA T4
Transformer XL 12,271 total tokens/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 32 WikiText-103 NVIDIA T4
82,561 total tokens/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 32 WikiText-103 NVIDIA T4
TensorFlow Inception V3 255 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.02-py3 Mixed 192 ImageNet2012 NVIDIA T4
1,914 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.02-py3 Mixed 192 ImageNet2012 NVIDIA T4
ResNet-50 V1.5 359 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 256 ImageNet2012 NVIDIA T4
2,784 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 256 ImageNet2012 NVIDIA T4
SSD v1.2 63 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 32 COCO 2017 NVIDIA T4
299 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 32 COCO 2017 NVIDIA T4
U-Net Industrial 32 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 16 DAGM2007 NVIDIA T4
208 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 2 DAGM2007 NVIDIA T4
U-Net Medical 20 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 8 EM segmentation challenge NVIDIA T4
149 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 8 EM segmentation challenge NVIDIA T4
V-Net Medical 157 images/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 FP32 32 Hippocampus head and body from Medical Segmentation Decathlon NVIDIA T4
1,134 images/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 FP32 32 Hippocampus head and body from Medical Segmentation Decathlon NVIDIA T4
VAE-CF 80,450 users processed/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 24576 MovieLens 20M NVIDIA T4
PyTorch GNMT V2 23,552 total tokens/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 128 wmt16-en-de NVIDIA T4
146,691 total tokens/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 128 wmt16-en-de NVIDIA T4
TensorFlow GNMT V2 9,642 total tokens/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 128 wmt16-en-de NVIDIA T4
55,559 total tokens/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 128 wmt16-en-de NVIDIA T4
PyTorch NCF 7,899,692 samples/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 1048576 MovieLens 20M NVIDIA T4
28,076,758 samples/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 1048576 MovieLens 20M NVIDIA T4
TensorFlow NCF 10,371,809 samples/sec 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed 1048576 MovieLens 20M NVIDIA T4
19,956,985 samples/sec 8x T4 Supermicro SYS-4029GP-TRT 20.03-py3 Mixed 1048576 MovieLens 20M NVIDIA T4

 

NVIDIA® TensorRT™ running on NVIDIA GPUs enable the most efficient deep learning inference performance across multiple application areas and models. This versatility provides wide latitude to data scientists to create the optimal low-latency solution. Visit NVIDIA GPU Cloud (NGC) to download any of these containers.

NVIDIA A100 Tensor Core GPUs provides unprecedented acceleration at every scale and across every framework and type of neural network. The 3rd Generation Tensor Cores brings maximum versatility by accelerating a full range of precisions—from FP32 to FP16 to INT8 and all the way down to INT4 and extends NVIDIA’s AI inference leadership.

NVIDIA V100 Tensor Cores GPUs leverage mixed-precision to combine high throughput with low latencies across every type of neural network. NVIDIA T4 is an inference GPU, designed for optimal power consumption and latency, for ultra-efficient scale-out servers. Read the inference whitepaper to learn more about NVIDIA’s inference platform.

Measuring the inference performance involves balancing a lot of variables. PLASTER is an acronym that describes the key elements for measuring deep learning performance. Each letter identifies a factor (Programmability, Latency, Accuracy, Size of Model, Throughput, Energy Efficiency, Rate of Learning) that must be considered to arrive at the right set of tradeoffs and to produce a successful deep learning implementation. Refer to NVIDIA’s PLASTER whitepaper for more details.

NVIDIA landed top performance spots on all five MLPerf Inference 0.5 benchmarks with the best per-accelerator performance among commercially available products.

NVIDIA Performance on MLPerf Inference v0.5 Benchmarks (Server)

NVIDIA Turing 70W: Supermicro 4029GP-TRT-OTO-28: 1x NVIDIA T4, 2x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.7GHz | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3
NVIDIA Turing 280W: SCAN 3XS DBP T496X2 Fluid: 1x TitanRTX, 2x Intel(R) Xeon(R) 8268 | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

NVIDIA Turing 70W: Supermicro 4029GP-TRT-OTO-28: 1x NVIDIA T4, 2x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.7GHz | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3
NVIDIA Turing 280W: SCAN 3XS DBP T496X2 Fluid: 1x TitanRTX, 2x Intel(R) Xeon(R) 8268 | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

NVIDIA Performance on MLPerf Inference v0.5 Benchmarks (Offline)

NVIDIA Turing 70W: Supermicro 4029GP-TRT-OTO-28: 1x NVIDIA T4, 2x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.7GHz | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3
NVIDIA Turing 280W: SCAN 3XS DBP T496X2 Fluid: 1x TitanRTX, 2x Intel(R) Xeon(R) 8268 | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

NVIDIA Turing 70W: Supermicro 4029GP-TRT-OTO-28: 1x NVIDIA T4, 2x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.7GHz | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3
NVIDIA Turing 280W: SCAN 3XS DBP T496X2 Fluid: 1x TitanRTX, 2x Intel(R) Xeon(R) 8268 | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

MLPerf v0.5 Inference results for data center server form factors and offline and server scenarios retrieved from www.mlperf.org on Nov. 6, 2019, from entries Inf-0.5-15,Inf-0. 5-16, Inf-0.5-19, Inf-0.5-21. Inf-0.5-22, Inf-0.5-23, Inf-0.5-25, Inf-0.5-26, Inf-0.5-27. Per-processor performance is calculated by dividing the primary metric of total performance by number of accelerators reported.

MLPerf name and logo are trademarks.

NVIDIA Performance on MLPerf Inference v0.5 Benchmarks (ResNet-50 V1.5 Offline Scenario)

MLPerf v0.5 Inference results for data center server form factors and offline scenario retrieved from www.mlperf.org on Nov. 6, 2019 (Closed Inf-0.5-25 and Inf-0.5-27 for INT8, Open Inf-0.5-461 and Inf-0.5-463 for INT4). Per-processor performance is calculated by dividing the primary metric of total performance by number of accelerators reported. MLPerf name and logo are trademarks.

MLPerf Inference Performance

NVIDIA Turing 70W

Network Network
Type
Batch
Size
Throughput Efficiency Latency GPU Server Container Precision Dataset GPU
Version
MobileNet v1 Server - 16,884 queries/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - ImageNet [224x224] NVIDIA T4
MobileNet v1 Offline - 17,726 inputs/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - ImageNet [224x224] NVIDIA T4
ResNet-50 v1.5 Server - 5,193 queries/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - ImageNet [224x224] NVIDIA T4
ResNet-50 v1.5 Offline - 5,622 inputs/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - ImageNet [224x224] NVIDIA T4
SSD MobileNet v1 Server - 7,078 queries/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - COCO [300x300] NVIDIA T4
SSD MobileNet v1 Offline - 7,609 inputs/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - COCO [300x300] NVIDIA T4
SSD ResNet-34 Server - 126 queries/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - COCO [1200x1200] NVIDIA T4
SSD ResNet-34 Offline - 137 inputs/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - COCO [1200x1200] NVIDIA T4
GNMT Server - 198 queries/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - WMT16 NVIDIA T4
GNMT Offline - 354 inputs/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - WMT16 NVIDIA T4

Supermicro 4029GP-TRT-OTO-28: 1x NVIDIA T4, 2x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.7GHz | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

NVIDIA Turing 280W

Network Network
Type
Batch
Size
Throughput Efficiency Latency GPU Server Container Precision Dataset GPU
Version
MobileNet v1 Server - 49,775 queries/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - ImageNet [224x224] TitanRTX
MobileNet v1 Offline - 55,597 inputs/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - ImageNet [224x224] TitanRTX
ResNet-50 v1.5 Server - 15,008 queries/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - ImageNet [224x224] TitanRTX
ResNet-50 v1.5 Offline - 16,563 inputs/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - ImageNet [224x224] TitanRTX
SSD MobileNet v1 Server - 20,503 queries/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - COCO [300x300] TitanRTX
SSD MobileNet v1 Offline - 22,945 inputs/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - COCO [300x300] TitanRTX
SSD ResNet-34 Server - 388 queries/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - COCO [1200x1200] TitanRTX
SSD ResNet-34 Offline - 415 inputs/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - COCO [1200x1200] TitanRTX
GNMT Server - 645 queries/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - WMT16 TitanRTX
GNMT Offline - 1,061 inputs/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - WMT16 TitanRTX

SCAN 3XS DBP T496X2 Fluid: 1x TitanRTX, 2x Intel(R) Xeon(R) 8268 | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

 

Inference Natural Langugage Processing

BERT Inference Throughput

DGX A100 server w/ 1x NVIDIA A100 with 7 MIG instances of 1g.5gb | Batch Size = 94 | Precision: INT8 | Sequence Length = 128
DGX-1 server w/ 1x NVIDIA V100 | TensorRT 7.1 | Batch Size = 256 | Precision: Mixed | Sequence Length = 128

 

NVIDIA A100 BERT Inference Benchmarks

Network Network
Type
Batch
Size
Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
BERT-Large with Sparsity Attention 94 6,188 sequences/sec - - 1x A100 DGX A100 - INT8 Sample Text - A100 SXM4-40GB

A100 with 7 MIG instances of 1g.5gb | Sequence length=128 | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

Inference Image Classification on CNNs with TensorRT

ResNet-50 v1.5 Throughput

DGX-1: Xeon E5-2698 v4 @2.2 GHz w/ 1x NVIDIA V100-SXM2-16GB | TensorRT 7.0 | Batch Size = 128 | 20.03-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 7.0 | Batch Size = 128 | 20.03-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Latency

GIGABYTE G291-Z20-00: AMD EPYC 7702P @2.0 GHz w/ 1x NVIDIA V100S-PCIE-32GB | TensorRT 7.0 | Batch Size = 1 | 20.03-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 7.0 | Batch Size = 1 | 20.03-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Power Efficiency

DGX-1: Xeon E5-2698 v4 @2.2 GHz w/ 1x NVIDIA V100-SXM2-16GB | TensorRT 7.0 | Batch Size = 128 | 20.03-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 7.0 | Batch Size = 128 | 20.03-py3 | Precision: INT8 | Dataset: Synthetic

 

Inference Performance

V100 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
MobileNet V1 1 4,715 images/sec 30.25 images/sec/watt 0.21 1x V100 DGX-1 20.03-py3 INT8 Synthetic TensorRT 7.0.0 V100-SXM2-16GB
2 6,453 images/sec 43.1 images/sec/watt 0.31 1x V100 DGX-1 20.03-py3 INT8 Synthetic TensorRT 7.0.0 V100-SXM2-32GB
8 15,621 images/sec 73 images/sec/watt 0.51 1x V100 DGX-2 19.12-py3 INT8 Synthetic TensorRT 6.0.1 V100-SXM3-32GB
220 31,610 images/sec 125.36 images/sec/watt 6.96 1x V100 DGX-2 20.03-py3 INT8 Synthetic TensorRT 7.0.0 V100-SXM3-32GB
128 29,737 images/sec 139.32 images/sec/watt 4.3 1x V100 DGX-1 20.03-py3 INT8 Synthetic TensorRT 7.0.0 V100-SXM2-16GB
ResNet-50 1 1,150 images/sec 7.5 images/sec/watt 0.87 1x V100 DGX-1 20.03-py3 INT8 Synthetic TensorRT 7.0.0 V100-SXM2-16GB
2 1,586 images/sec 9.74 images/sec/watt 1.26 1x V100 DGX-1 20.03-py3 INT8 Synthetic TensorRT 7.0.0 V100-SXM2-16GB
8 3,404 images/sec 16.36 images/sec/watt 2.35 1x V100 DGX-1 20.03-py3 Mixed Synthetic TensorRT 7.0.0 V100-SXM2-16GB
52 7,760 images/sec 22.45 images/sec/watt 6.7 1x V100 DGX-2 20.03-py3 Mixed Synthetic TensorRT 7.0.0 V100-SXM3-32GB
128 7,448 images/sec 25.62 images/sec/watt 17 1x V100 DGX-1 20.03-py3 Mixed Synthetic TensorRT 7.0.0 V100-SXM2-16GB
128 7,867 images/sec 23.12 images/sec/watt 16.27 1x V100 DGX-2 20.03-py3 Mixed Synthetic TensorRT 7.0.0 V100-SXM3-32GB
ResNet-50v1.5 1 1,006 images/sec 7.16 images/sec/watt 0.99 1x V100 GIGABYTE G291-Z20-00 20.03-py3 INT8 Synthetic TensorRT 7.0.0 V100S-PCIE-32GB
2 1,416 images/sec 8.41 images/sec/watt 1.41 1x V100 DGX-1 20.03-py3 INT8 Synthetic TensorRT 7.0.0 V100-SXM2-16GB
8 3,301 images/sec 16.73 images/sec/watt 2.42 1x V100 DGX-1 20.03-py3 Mixed Synthetic TensorRT 7.0.0 V100-SXM2-16GB
50 7,216 images/sec 20.94 images/sec/watt 6.93 1x V100 DGX-2 20.03-py3 Mixed Synthetic TensorRT 7.0.0 V100-SXM3-32GB
128 7,100 images/sec 24.23 images/sec/watt 18 1x V100 DGX-1 20.03-py3 Mixed Synthetic TensorRT 7.0.0 V100-SXM2-16GB
128 7,460 images/sec 21.76 images/sec/watt 17.16 1x V100 DGX-2 20.03-py3 Mixed Synthetic TensorRT 7.0.0 V100-SXM3-32GB
NCF 16,384 34,601,952 samples/sec - samples/sec/watt 0 1x V100 DGX-2 20.02-py3 Mixed MovieLens 20 Million PyTorch 1.5.0a0+3bbb36e V100-SXM3-32GB
BERT-BASE 1 766 sequences/sec - sequences/sec/watt 1.31 1x V100 DGX-1 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM2-16GB
2 1,295 sequences/sec - sequences/sec/watt 1.54 1x V100 DGX-1 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM2-16GB
8 2,358 sequences/sec - sequences/sec/watt 3.39 1x V100 DGX-1 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM2-16GB
26 3,002 sequences/sec 13.36 sequences/sec/watt 8.66 1x V100 DGX-2 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB
128 3,033 sequences/sec 11.82 sequences/sec/watt 42.2 1x V100 DGX-2 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB
BERT-LARGE 256 897 sequences/sec - - 1x V100 DGX-1 - Mixed Sample Text TensorRT 7.1 V100-SXM2-16GB
Mask R-CNN 1 17 images/sec 0.1 images/sec/watt - 1x V100 SuperMicro Server 20.03-py3 FP32 COCO 2014 PyTorch 1.5.0a0+8f84ded V100S-PCIE-32GB
2 22 images/sec 0.26 images/sec/watt 0.18 1x V100 DGX-1 20.03-py3 Mixed COCO 2017 TensorFlow 1.15.2 V100-SXM2-16GB
8 26 images/sec 0.29 images/sec/watt - 1x V100 DGX-1 20.03-py3 Mixed COCO 2017 TensorFlow 1.15.2 V100-SXM2-16GB
ResNeXt101 1 63 images/sec 0.91 images/sec/watt 15.92 1x V100 SuperMicro Server 20.03-py3 Mixed Imagenet2012 PyTorch 1.5.0a0+8f84ded V100S-PCIE-32GB
2 125 images/sec 1.53 images/sec/watt 15.88 1x V100 SuperMicro Server 20.03-py3 Mixed Imagenet2012 PyTorch 1.5.0a0+8f84ded V100S-PCIE-32GB
8 433 images/sec 3.15 images/sec/watt 18.43 1x V100 SuperMicro Server 20.03-py3 Mixed Imagenet2012 PyTorch 1.5.0a0+8f84ded V100S-PCIE-32GB
128 842 images/sec 3.99 images/sec/watt 124.21 1x V100 SuperMicro Server 20.03-py3 Mixed Imagenet2012 PyTorch 1.5.0a0+8f84ded V100S-PCIE-32GB
SE-ResNeXt101 1 44 images/sec 0.5 images/sec/watt 22.63 1x V100 SuperMicro Server 20.03-py3 FP32 Imagenet2012 PyTorch 1.5.0a0+8f84ded V100S-PCIE-32GB
2 89 images/sec 0.73 images/sec/watt 22.33 1x V100 SuperMicro Server 20.03-py3 FP32 Imagenet2012 PyTorch 1.5.0a0+8f84ded V100S-PCIE-32GB
8 343 images/sec 2.69 images/sec/watt 23.34 1x V100 SuperMicro Server 20.03-py3 Mixed Imagenet2012 PyTorch 1.5.0a0+8f84ded V100S-PCIE-32GB
128 817 images/sec 3.82 images/sec/watt 132.74 1x V100 SuperMicro Server 20.03-py3 Mixed Imagenet2012 PyTorch 1.5.0a0+8f84ded V100S-PCIE-32GB
SSD ResNet-50 1 174 images/sec - images/sec/watt - 1x V100 DGX-2 20.03-py3 Mixed COCO 2017 TensorFlow 1.15.2 V100-SXM3-32GB
2 233 images/sec - images/sec/watt - 1x V100 DGX-2 20.03-py3 Mixed COCO 2017 TensorFlow 1.15.2 V100-SXM3-32GB
8 465 images/sec - images/sec/watt - 1x V100 GIGABYTE G291-Z20-00 20.03-py3 Mixed COCO 2017 PyTorch 1.5.0a0+8f84ded V100S-PCIE-32GB
32 555 images/sec - images/sec/watt - 1x V100 GIGABYTE G291-Z20-00 20.03-py3 Mixed COCO 2017 PyTorch 1.5.0a0+8f84ded V100S-PCIE-32GB
Tacotron2 1 1,150 total output mels/sec - total output mels/sec/watt 1739.87 1x V100 SuperMicro Server 20.01-py3 FP32 LJSpeech 1.1 PyTorch 1.4.0a0+a5b4d78 V100S-PCIE-32GB
4 4,198 total output mels/sec - total output mels/sec/watt 1905.76 1x V100 SuperMicro Server 20.03-py3 FP32 LJSpeech 1.1 PyTorch 1.5.0a0+8f84ded V100S-PCIE-32GB
Transformer XL 1 5,961 total tokens/sec 68.7 total tokens/sec/watt 10.74 1x V100 SuperMicro Server 20.03-py3 Mixed WikiText-103 PyTorch 1.5.0a0+8f84ded V100S-PCIE-32GB
2 11,617 total tokens/sec - total tokens/sec/watt 11.02 1x V100 SuperMicro Server 20.03-py3 Mixed WikiText-103 PyTorch 1.5.0a0+8f84ded V100S-PCIE-32GB
8 38,676 total tokens/sec - total tokens/sec/watt 13.23 1x V100 SuperMicro Server 20.03-py3 Mixed WikiText-103 PyTorch 1.5.0a0+8f84ded V100S-PCIE-32GB
32 54,532 total tokens/sec - total tokens/sec/watt 37.52 1x V100 DGX-1 20.03-py3 Mixed WikiText-103 PyTorch 1.5.0a0+8f84ded V100-SXM2-16GB
U-Net Industrial 16 322 images/sec 1.44 images/sec/watt - 1x V100 DGX-1 20.03-py3 Mixed DAGM2007 TensorFlow 1.15.2 V100-SXM2-16GB
U-Net Medical 1 132 images/sec - images/sec/watt 7.59 1x V100 DGX-1 20.03-py3 Mixed EM segmentation challenge TensorFlow 1.15.2 V100-SXM2-16GB
2 153 images/sec - images/sec/watt 13.1 1x V100 DGX-1 20.03-py3 Mixed EM segmentation challenge TensorFlow 1.15.2 V100-SXM2-16GB
4 173 images/sec - images/sec/watt 23.07 1x V100 DGX-1 20.03-py3 Mixed EM segmentation challenge TensorFlow 1.15.2 V100-SXM2-16GB
8 185 images/sec - images/sec/watt 45.08 1x V100 DGX-1 20.03-py3 Mixed EM segmentation challenge TensorFlow 1.15.2 V100-SXM2-16GB
16 197 images/sec 0.82 images/sec/watt 83.52 1x V100 DGX-2 20.03-py3 Mixed EM segmentation challenge TensorFlow 1.15.2 V100-SXM3-32GB
V-Net Medical 1 315 images/sec - images/sec/watt - 1x V100 DGX-2 20.01-py3 Mixed Hippocampus head and body from Medical Segmentation Decathlon TensorFlow 1.15.0 V100-SXM3-32GB
2 620 images/sec - images/sec/watt - 1x V100 DGX-2 20.01-py3 Mixed Hippocampus head and body from Medical Segmentation Decathlon TensorFlow 1.15.0 V100-SXM3-32GB
8 1,356 images/sec - images/sec/watt - 1x V100 DGX-1 20.01-py3 Mixed Hippocampus head and body from Medical Segmentation Decathlon TensorFlow 1.15.0 V100-SXM2-16GB
32 2,064 images/sec - images/sec/watt - 1x V100 GIGABYTE G291-Z20-00 20.01-py3 Mixed Hippocampus head and body from Medical Segmentation Decathlon TensorFlow 1.15.0 V100S-PCIE-32GB
VAE-CF 128 74,763 users processed/sec - users processed/sec/watt - 1x V100 DGX-2 20.03-py3 FP32 MovieLens 20M TensorFlow 1.15.2 V100-SXM3-32GB
24576 211,925 users processed/sec - users processed/sec/watt - 1x V100 GIGABYTE G291-Z20-00 20.03-py3 Mixed MovieLens 20M TensorFlow 1.15.2 V100S-PCIE-32GB
WaveGlow 1 361,665 output samples/sec - output samples/sec/watt 633.51 1x V100 DGX-1 20.03-py3 Mixed LJSpeech 1.1 PyTorch 1.5.0a0+8f84ded V100-SXM2-16GB
4 455,940 output samples/sec - output samples/sec/watt 2010.09 1x V100 DGX-1 20.03-py3 Mixed LJSpeech 1.1 PyTorch 1.5.0a0+8f84ded V100-SXM2-16GB

NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

T4 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU
Version
MobileNet V1 1 4,549 images/sec 73.17 images/sec/watt 0.22 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 INT8 Synthetic TensorRT 7.0.0 NVIDIA T4
2 8,210 images/sec 119.22 images/sec/watt 0.24 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 INT8 Synthetic TensorRT 7.0.0 NVIDIA T4
8 14,014 images/sec 201.52 images/sec/watt 0.57 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 INT8 Synthetic TensorRT 7.0.0 NVIDIA T4
120 17,770 images/sec 253.98 images/sec/watt 6.75 1x T4 Supermicro SYS-4029GP-TRT 19.12-py3 INT8 Synthetic TensorRT 6.0.1 NVIDIA T4
128 16,144 images/sec 231.01 images/sec/watt 7.93 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 INT8 Synthetic TensorRT 7.0.0 NVIDIA T4
ResNet-50 1 1,172 images/sec 16.79 images/sec/watt 0.85 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 INT8 Synthetic TensorRT 7.0.0 NVIDIA T4
2 1,732 images/sec 24.83 images/sec/watt 1.15 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 INT8 Synthetic TensorRT 7.0.0 NVIDIA T4
8 3,939 images/sec 56.74 images/sec/watt 2.03 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 INT8 Synthetic TensorRT 7.0.0 NVIDIA T4
33 4,927 images/sec 70.54 images/sec/watt 6.7 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 INT8 Synthetic TensorRT 7.0.0 NVIDIA T4
128 5,415 images/sec 77.45 images/sec/watt 23.64 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 INT8 Synthetic TensorRT 7.0.0 NVIDIA T4
ResNet-50v1.5 1 1,126 images/sec 16.1 images/sec/watt 0.89 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 INT8 Synthetic TensorRT 7.0.0 NVIDIA T4
2 1,741 images/sec 25.4 images/sec/watt 1.15 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 INT8 Synthetic TensorRT 7.0.0 NVIDIA T4
8 3,808 images/sec 54.33 images/sec/watt 2.1 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 INT8 Synthetic TensorRT 7.0.0 NVIDIA T4
32 4,698 images/sec 67.26 images/sec/watt 6.81 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 INT8 Synthetic TensorRT 7.0.0 NVIDIA T4
128 5,083 images/sec 72.67 images/sec/watt 25.18 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 INT8 Synthetic TensorRT 7.0.0 NVIDIA T4
NCF 16384 24,325,977 samples/sec 430872.65 samples/sec/watt - 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed MovieLens 20M TensorFlow 1.15.2 NVIDIA T4
BERT-BASE 1 521 sequences/sec - sequences/sec/watt 1.92 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Sample Text TensorRT 7.0.0 NVIDIA T4
2 691 sequences/sec - sequences/sec/watt 2.9 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Sample Text TensorRT 7.0.0 NVIDIA T4
8 820 sequences/sec 11.88 sequences/sec/watt 9.75 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Sample Text TensorRT 7.0.0 NVIDIA T4
128 852 sequences/sec 12.27 sequences/sec/watt 150.31 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Sample Text TensorRT 7.0.0 NVIDIA T4
BERT-LARGE 256 524 sequences/sec - - 1x T4 Supermicro SYS-1029GQ-TRT - INT8 Sample Text TensorRT 7.1 NVIDIA T4
Mask R-CNN 1 11 images/sec 0.17 images/sec/watt - 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed COCO 2014 PyTorch 1.5.0a0+8f84ded NVIDIA T4
2 12 images/sec 0.27 images/sec/watt 0.22 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed COCO 2017 TensorFlow 1.15.2 NVIDIA T4
8 13 images/sec 0.29 images/sec/watt 0.77 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed COCO 2017 TensorFlow 1.15.2 NVIDIA T4
ResNeXt101 1 62 images/sec 1.34 images/sec/watt 16.07 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Imagenet2012 PyTorch 1.5.0a0+8f84ded NVIDIA T4
2 122 images/sec 2.28 images/sec/watt 16.31 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Imagenet2012 PyTorch 1.5.0a0+8f84ded NVIDIA T4
8 299 images/sec 4.3 images/sec/watt 26.74 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Imagenet2012 PyTorch 1.5.0a0+8f84ded NVIDIA T4
128 358 images/sec 4.91 images/sec/watt 323.9 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Imagenet2012 PyTorch 1.5.0a0+8f84ded NVIDIA T4
SE-ResNeXt101 1 46 images/sec 1.07 images/sec/watt 21.74 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Imagenet2012 PyTorch 1.5.0a0+8f84ded NVIDIA T4
2 89 images/sec 1.79 images/sec/watt 22.51 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Imagenet2012 PyTorch 1.5.0a0+8f84ded NVIDIA T4
8 274 images/sec 3.93 images/sec/watt 29.23 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Imagenet2012 PyTorch 1.5.0a0+8f84ded NVIDIA T4
128 338 images/sec 5.14 images/sec/watt 344.65 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Imagenet2012 PyTorch 1.5.0a0+8f84ded NVIDIA T4
SSD ResNet-50 1 99 images/sec - images/sec/watt - 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed COCO 2017 PyTorch 1.5.0a0+8f84ded NVIDIA T4
2 149 images/sec - images/sec/watt - 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed COCO 2017 PyTorch 1.5.0a0+8f84ded NVIDIA T4
8 195 images/sec - images/sec/watt - 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed COCO 2017 PyTorch 1.5.0a0+8f84ded NVIDIA T4
32 212 images/sec - images/sec/watt - 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed COCO 2017 PyTorch 1.5.0a0+8f84ded NVIDIA T4
Tacotron2 1 1,093 total output mels/sec - total output mels/sec/watt 1829.68 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 FP32 LJSpeech 1.1 PyTorch 1.5.0a0+8f84ded NVIDIA T4
4 4,256 total output mels/sec - total output mels/sec/watt 1879.81 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 FP32 LJSpeech 1.1 PyTorch 1.5.0a0+8f84ded NVIDIA T4
Transformer XL 1 6,002 total tokens/sec 106 total tokens/sec/watt 10.68 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed WikiText-103 PyTorch 1.5.0a0+8f84ded NVIDIA T4
2 11,149 total tokens/sec 161.61 total tokens/sec/watt 11.49 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed WikiText-103 PyTorch 1.5.0a0+8f84ded NVIDIA T4
8 18,436 total tokens/sec - total tokens/sec/watt 27.76 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed WikiText-103 PyTorch 1.5.0a0+8f84ded NVIDIA T4
32 20,516 total tokens/sec - total tokens/sec/watt 99.75 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed WikiText-103 PyTorch 1.5.0a0+8f84ded NVIDIA T4
U-Net Industrial 16 118 images/sec 1.76 images/sec/watt - 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed DAGM2007 TensorFlow 1.15.2 NVIDIA T4
U-Net Medical 1 56 images/sec - images/sec/watt 17.83 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed EM segmentation challenge TensorFlow 1.15.2 NVIDIA T4
2 63 images/sec 1.21 images/sec/watt 31.57 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed EM segmentation challenge TensorFlow 1.15.2 NVIDIA T4
4 68 images/sec 1.28 images/sec/watt 58.64 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed EM segmentation challenge TensorFlow 1.15.2 NVIDIA T4
V-Net Medical 1 185 images/sec - images/sec/watt - 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Hippocampus head and body from Medical Segmentation Decathlon TensorFlow 1.15.2 NVIDIA T4
2 270 images/sec - images/sec/watt - 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Hippocampus head and body from Medical Segmentation Decathlon TensorFlow 1.15.2 NVIDIA T4
8 429 images/sec - images/sec/watt - 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 FP32 Hippocampus head and body from Medical Segmentation Decathlon TensorFlow 1.15.2 NVIDIA T4
32 500 images/sec 8.07 images/sec/watt - 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed Hippocampus head and body from Medical Segmentation Decathlon TensorFlow 1.15.2 NVIDIA T4
VAE-CF 128 26,149 users processed/sec - users process/sec/watt - 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 FP32 MovieLens 20M TensorFlow 1.15.2 NVIDIA T4
24576 62,813 users processed/sec - users process/sec/watt - 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed MovieLens 20M TensorFlow 1.15.2 NVIDIA T4
WaveGlow 1 67,914 output samples/sec - output samples/sec/watt 3373.68 1x T4 Supermicro SYS-1029GQ-TRT 20.03-py3 Mixed LJSpeech 1.1 PyTorch 1.5.0a0+8f84ded NVIDIA T4

NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

Last updated: May 18th, 2020