Review the latest GPU acceleration factors of popular HPC applications.

Please refer to Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide for instructions on how to reproduce these performance claims.


NVIDIA’s complete solution stack, from GPUs to libraries, and containers on NVIDIA GPU Cloud (NGC), allows data scientists to quickly get up and running with deep learning. NVIDIA® V100 Tensor Core GPUs leverage mixed precision to accelerate deep learning training throughputs across every framework and every type of neural network. NVIDIA breaks performance records on MLPerf, the AI’s first industry-wide benchmark, a testament to our GPU-accelerated platform approach.

NVIDIA Performance on MLPerf 0.6 AI Benchmarks

ResNet-50 v1.5 Time to Solution on V100

MXNet | Batch Size refer to CNN V100 Training table below | Precision: Mixed | Dataset: ImageNet2012 | Convergence criteria - refer to MLPerf requirements

Training Performance

NVIDIA Performance on MLPerf 0.6 AI Benchmarks

FrameworkNetworkNetwork TypeTime to Solution GPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetResNet-50 v1.5CNN115.22 minutes8x V100DGX-10.6-8MixedImageNet2012V100-SXM2-16GB
CNN57.87 minutes16x V100DGX-20.6-17MixedImageNet2012V100-SXM3-32GB
CNN52.74 minutes16x V100DGX-2H0.6-19MixedImageNet2012V100-SXM3-32GB-H
CNN2.59 minutes512x V100DGX-2H0.6-29MixedImageNet2012V100-SXM3-32GB-H
CNN1.69 minutes1040x V100DGX-10.6-16MixedImageNet2012V100-SXM2-16GB
CNN1.33 minutes1536x V100DGX-2H0.6-30MixedImageNet2012V100-SXM3-32GB-H
PyTorchSSD-ResNet-34CNN22.36 minutes8x V100DGX-10.6-9MixedCOCO2017V100-SXM2-16GB
CNN12.21 minutes16x V100DGX-20.6-18MixedCOCO2017V100-SXM3-32GB
CNN11.41 minutes16x V100DGX-2H0.6-20MixedCOCO2017V100-SXM3-32GB-H
CNN4.78 minutes64x V100DGX-2H0.6-21MixedCOCO2017V100-SXM3-32GB-H
CNN2.67 minutes240x V100DGX-10.6-13MixedCOCO2017V100-SXM2-16GB
CNN2.56 minutes240x V100DGX-2H0.6-24MixedCOCO2017V100-SXM3-32GB-H
CNN2.23 minutes240x V100DGX-2H0.6-27MixedCOCO2017V100-SXM3-32GB-H
Mask R-CNNCNN207.48 minutes8x V100DGX-10.6-9MixedCOCO2017V100-SXM2-16GB
CNN101 minutes16x V100DGX-20.6-18MixedCOCO2017V100-SXM3-32GB
CNN95.2 minutes16x V100DGX-2H0.6-20MixedCOCO2017V100-SXM3-32GB-H
CNN32.72 minutes64x V100DGX-2H0.6-21MixedCOCO2017V100-SXM3-32GB-H
CNN22.03 minutes192x V100DGX-10.6-12MixedCOCO2017V100-SXM2-16GB
CNN18.47 minutes192x V100DGX-2H0.6-23MixedCOCO2017V100-SXM3-32GB-H
PyTorchGNMTRNN20.55 minutes8x V100DGX-10.6-9MixedWMT16 English-GermanV100-SXM2-16GB
RNN10.94 minutes16x V100DGX-20.6-18MixedWMT16 English-GermanV100-SXM3-32GB
RNN9.87 minutes16x V100DGX-2H0.6-20MixedWMT16 English-GermanV100-SXM3-32GB-H
RNN2.12 minutes256x V100DGX-2H0.6-25MixedWMT16 English-GermanV100-SXM3-32GB-H
RNN1.99 minutes384x V100DGX-10.6-14MixedWMT16 English-GermanV100-SXM2-16GB
RNN1.8 minutes384x V100DGX-2H0.6-26MixedWMT16 English-GermanV100-SXM3-32GB-H
PyTorchTransformerAttention20.34 minutes8x V100DGX-10.6-9MixedWMT17 English-GermanV100-SXM2-16GB
Attention11.04 minutes16x V100DGX-20.6-18MixedWMT17 English-GermanV100-SXM3-32GB
Attention9.8 minutes16x V100DGX-2H0.6-20MixedWMT17 English-GermanV100-SXM3-32GB-H
Attention2.41 minutes160x V100DGX-2H0.6-22MixedWMT17 English-GermanV100-SXM3-32GB-H
Attention2.05 minutes480x V100DGX-10.6-15MixedWMT17 English-GermanV100-SXM2-16GB
Attention1.59 minutes480x V100DGX-2H0.6-28MixedWMT17 English-GermanV100-SXM3-32GB-H
TensorFlowMiniGoReinforcement Learning27.39 minutes8x V100DGX-10.6-10MixedN/AV100-SXM2-16GB
Reinforcement Learning13.57 minutes24x V100DGX-10.6-11MixedN/AV100-SXM2-16GB

Training Image Classification on CNNs

ResNet-50 V1.5 Throughput on V100

DGX-1: 8x NVIDIA V100-SXM2-16GB for MXNet and PyTorch. NVIDIA V100-SXM2-32GB for TensorFlow, E5-2698 v4 2.2 GHz | Batch Size: MXNet = 208, PyTorch = 256 and TensorFlow = 512 | MXNet, PyTorch, TensorFlow = 19.12-py3 | Precision: Mixed | Dataset: ImageNet2012

ResNet-50 V1.5 Throughput on T4

Supermicro SYS-4029GP-TRT T4: 8x NVIDIA T4, Gold 6240 2.6 GHz for MXNet, PyTorch and TensorFlow | Batch Size: MXNet = 208, PyTorch and TensorFlow = 256 | MXNet, PyTorch, TensorFlow = 19.12-py3 | Precision: Mixed | Dataset: ImageNet2012

V100 Training Performance

FrameworkNetworkNetwork TypeThroughput GPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNetInception V3CNN553 images/sec1x V100DGX-119.12-py3Mixed208ImageNet2012V100-SXM2-16GB
CNN623.28 images/sec1x V100DGX-2H19.12-py3Mixed384ImageNet2012V100-SXM3-32GB-H
CNN4,261.95 images/sec8x V100DGX-119.12-py3Mixed208ImageNet2012V100-SXM2-16GB
CNN4,756.47 images/sec8x V100DGX-2H19.12-py3Mixed384ImageNet2012V100-SXM3-32GB-H
ResNet-50CNN1,382 images/sec1x V100DGX-119.11-py3Mixed208ImageNet2012V100-SXM2-16GB
CNN1,445 images/sec1x V100DGX-219.11-py3Mixed512ImageNet2012V100-SXM3-32GB
CNN1,551 images/sec1x V100DGX-2H19.11-py3Mixed512ImageNet2012V100-SXM3-32GB-H
CNN10,358 images/sec8x V100DGX-119.11-py3Mixed192ImageNet2012V100-SXM2-16GB
CNN10,805 images/sec8x V100DGX-219.11-py3Mixed256ImageNet2012V100-SXM3-32GB
CNN11,507 images/sec8x V100DGX-2H19.11-py3Mixed256ImageNet2012V100-SXM3-32GB-H
ResNet-50 v1.5CNN1,468.55 images/sec1x V100DGX-119.12-py3Mixed208ImageNet2012V100-SXM2-16GB
CNN1,658.31 images/sec1x V100DGX-2H19.12-py3Mixed256ImageNet2012V100-SXM3-32GB-H
CNN10,787.44 images/sec8x V100DGX-119.12-py3Mixed208ImageNet2012V100-SXM2-16GB
CNN11,486.48 images/sec8x V100DGX-219.12-py3Mixed256ImageNet2012V100-SXM3-32GB
CNN12,228.38 images/sec8x V100DGX-2H19.12-py3Mixed256ImageNet2012V100-SXM3-32GB-H
PyTorchInception V3CNN534.39 images/sec1x V100DGX-119.12-py3Mixed256ImageNet2012V100-SXM2-16GB
CNN596.54 images/sec1x V100DGX-2H19.12-py3Mixed512ImageNet2012V100-SXM3-32GB-H
CNN4,161.36 images/sec8x V100DGX-119.12-py3Mixed256ImageNet2012V100-SXM2-16GB
Mask R-CNNCNN16 images/sec1x V100DGX-119.11-py3Mixed16COCO 2014V100-SXM2-32GB
CNN16.52 images/sec1x V100DGX-219.12-py3Mixed16COCO 2014V100-SXM3-32GB
CNN86.89 images/sec8x V100DGX-119.12-py3Mixed4COCO 2014V100-SXM2-16GB
ResNet-50CNN905 images/sec1x V100DGX-119.11-py3Mixed256ImageNet2012V100-SXM2-16GB
CNN926 images/sec1x V100DGX-219.11-py3Mixed512ImageNet2012V100-SXM3-32GB
CNN1,025 images/sec1x V100DGX-2H19.11-py3Mixed512ImageNet2012V100-SXM3-32GB-H
CNN6,179 images/sec8x V100DGX-119.11-py3Mixed256ImageNet2012V100-SXM2-16GB
CNN6,595 images/sec8x V100DGX-219.11-py3Mixed512ImageNet2012V100-SXM3-32GB
CNN7,151 images/sec8x V100DGX-2H19.11-py3Mixed512ImageNet2012V100-SXM3-32GB-H
ResNet-50 V1.5CNN912.74 images/sec1x V100DGX-119.12-py3Mixed256ImageNet2012V100-SXM2-16GB
CNN1,007.19 images/sec1x V100DGX-2H19.12-py3Mixed256ImageNet2012V100-SXM3-32GB-H
CNN6,881.12 images/sec8x V100DGX-119.12-py3Mixed256ImageNet2012V100-SXM2-16GB
SSD v1.1CNN235.29 images/sec1x V100DGX-119.12-py3Mixed64COCO 2017V100-SXM2-16GB
CNN262.1 images/sec1x V100DGX-2H19.12-py3Mixed64COCO 2017V100-SXM3-32GB-H
CNN1,830 images/sec8x V100DGX-119.12-py3Mixed64COCO 2017V100-SXM2-16GB
Tacotron2CNN19,621 total output mels/sec1x V100DGX-119.11-py3Mixed128LJSpeech 1.1V100-SXM2-32GB
CNN24,269 total output mels/sec1x V100DGX-2H19.11-py3Mixed128LJSpeech 1.1V100-SXM3-32GB-H
CNN116,696 total output mels/sec8x V100DGX-119.11-py3Mixed128LJSpeech 1.1V100-SXM2-32GB
CNN150,263 total output mels/sec8x V100DGX-2H19.11-py3Mixed128LJSpeech 1.1V100-SXM3-32GB-H
WaveGlowCNN77,732 output samples/sec1x V100DGX-119.09-py3Mixed10LJSpeech 1.1V100-SXM2-16GB
CNN92,280 output samples/sec1x V100DGX-2H19.10-py3Mixed10LJSpeech 1.1V100-SXM3-32GB-H
CNN557,814 output samples/sec8x V100DGX-119.10-py3Mixed10LJSpeech 1.1V100-SXM2-16GB
JasperCNN41 sequences/sec1x V100DGX-119.12-py3Mixed64LibriSpeechV100-SXM2-32GB
CNN49 sequences/sec1x V100DGX-2H19.12-py3Mixed64LibriSpeechV100-SXM3-32GB-H
CNN305 sequences/sec8x V100DGX-119.12-py3Mixed64LibriSpeechV100-SXM2-32GB
TensorFlowInception V3CNN727.15 images/sec1x V100DGX-119.12-py3Mixed384ImageNet2012V100-SXM2-32GB
CNN830.94 images/sec1x V100DGX-2H19.12-py3Mixed384ImageNet2012V100-SXM3-32GB-H
CNN5,516.26 images/sec8x V100DGX-119.12-py3Mixed384ImageNet2012V100-SXM2-32GB
CNN5,682.88 images/sec8x V100DGX-2H19.12-py3Mixed384ImageNet2012V100-SXM3-32GB-H
ResNet-50 V1.5CNN1064.75 images/sec1x V100DGX-119.12-py3Mixed256ImageNet2012V100-SXM2-16GB
CNN1,240.33 images/sec1x V100DGX-2H19.12-py3Mixed512ImageNet2012V100-SXM3-32GB-H
CNN8,116.45 images/sec8x V100DGX-119.12-py3Mixed512ImageNet2012V100-SXM2-32GB
CNN8,445 images/sec8x V100DGX-2H19.12-py3Mixed512ImageNet2012V100-SXM3-32GB-H
SSD v1.2CNN124.99 images/sec1x V100DGX-119.12-py3Mixed32COCO 2017V100-SXM2-16GB
CNN140.34 images/sec1x V100DGX-2H19.12-py3Mixed32COCO 2017V100-SXM3-32GB-H
CNN599.5 images/sec8x V100DGX-119.12-py3Mixed32COCO 2017V100-SXM2-16GB
CNN726.54 images/sec8x V100DGX-2H19.12-py3Mixed32COCO 2017V100-SXM3-32GB-H
U-Net IndustrialCNN100 images/sec1x V100DGX-119.12-py3Mixed16DAGM2007V100-SXM2-16GB
CNN112 images/sec1x V100DGX-2H19.12-py3Mixed16DAGM2007V100-SXM3-32GB-H
CNN528 images/sec8x V100DGX-119.12-py3Mixed2DAGM2007V100-SXM2-16GB
CNN570 images/sec8x V100DGX-2H19.12-py3Mixed2DAGM2007V100-SXM3-32GB-H
PyTorchGNMT V2RNN56,250 total tokens/sec1x V100DGX-119.12-py3Mixed128WMT16 English-GermanV100-SXM2-16GB
RNN64,515 total tokens/sec1x V100DGX-2H19.12-py3Mixed128WMT16 English-GermanV100-SXM3-32GB-H
RNN407,068 total tokens/sec8x V100DGX-119.12-py3Mixed128WMT16 English-GermanV100-SXM2-16GB
RNN445,451 total tokens/sec8x V100DGX-2H19.12-py3Mixed128WMT16 English-GermanV100-SXM3-32GB-H
TensorFlowGNMT V2RNN25,159 total tokens/sec1x V100DGX-119.10-py3Mixed192WMT16 English-GermanV100-SXM2-32GB
RNN29,972 total tokens/sec1x V100DGX-2H19.10-py3Mixed192WMT16 English-GermanV100-SXM3-32GB-H
RNN163,445 total tokens/sec8x V100DGX-119.10-py3Mixed192WMT16 English-GermanV100-SXM2-32GB
PyTorchNCFRecommender21,395,676 samples/sec1x V100DGX-119.12-py3Mixed1,048,576MovieLens 20 MillionV100-SXM2-16GB
Recommender23,683,277 samples/sec1x V100DGX-2H19.12-py3Mixed1,048,576MovieLens 20 MillionV100-SXM3-32GB-H
Recommender99,257,568 samples/sec8x V100DGX-119.12-py3Mixed1,048,576MovieLens 20 MillionV100-SXM2-16GB
Recommender105,607,286 samples/sec8x V100DGX-2H19.12-py3Mixed1,048,576MovieLens 20 MillionV100-SXM3-32GB-H
TensorFlowNCFRecommender26,470,121 samples/sec1x V100DGX-119.12-py3Mixed1,048,576MovieLens 20 MillionV100-SXM2-16GB
Recommender28,113,739 samples/sec1x V100DGX-2H19.12-py3Mixed1,048,576MovieLens 20 MillionV100-SXM3-32GB-H
Recommender76,749,507 samples/sec8x V100DGX-119.12-py3Mixed1,048,576MovieLens 20 MillionV100-SXM2-16GB
PyTorchBERT-LARGEAttention44.8 items/sec1x V100DGX-119.12-py3Mixed10SQuaD v1.1V100-SXM2-32GB
Attention52.81 items/sec1x V100DGX-2H19.12-py3Mixed10SQuaD v1.1V100-SXM3-32GB-H
Attention326.06 items/sec8xV100DGX-119.12-py3Mixed10SQuaD v1.1V100-SXM2-32GB
Attention369.83 items/sec8xV100DGX-2H19.12-py3Mixed10SQuaD v1.1V100-SXM3-32GB-H
TensorFlowBERT-LARGEAttention38.04 sentences/sec1x V100DGX-119.12-py3Mixed10SQuaD v1.1V100-SXM2-32GB
Attention44.32 sentences/sec1x V100DGX-2H19.12-py3Mixed10SQuaD v1.1V100-SXM3-32GB-H
Attention195.44 sentences/sec8x V100DGX-119.12-py3Mixed10SQuaD v1.1V100-SXM2-32GB
Attention221.14 sentences/sec8x V100DGX-2H19.12-py3Mixed10SQuaD v1.1V100-SXM3-32GB-H

T4 Training Performance

FrameworkNetworkNetwork TypeThroughput GPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNetInception V3CNN180.13 images/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed208ImageNet2012NVIDIA T4
CNN1,420.91 images/sec8x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed208ImageNet2012NVIDIA T4
ResNet-50 v1.5CNN486.59 images/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed208ImageNet2012NVIDIA T4
CNN3,758.94 images/sec8x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed208ImageNet2012NVIDIA T4
PyTorchInception V3CNN172.41 images/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed256ImageNet2012NVIDIA T4
CNN1,348.85 images/sec8x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed256ImageNet2012NVIDIA T4
Mask R-CNNCNN6.28 images/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed4COCO 2014NVIDIA T4
CNN39.04 images/sec8x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed4COCO 2014NVIDIA T4
ResNet-50 V1.5CNN280.21 images/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed256ImageNet2012NVIDIA T4
CNN2,245.7 images/sec8x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed256ImageNet2012NVIDIA T4
SSD v1.1CNN76.89 images/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed64COCO 2017NVIDIA T4
CNN637.25 images/sec8x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed64COCO 2017NVIDIA T4
Tacotron2CNN13,451 total output mels/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed104LJSpeech 1.1NVIDIA T4
CNN103,679 total output mels/sec8x T4Supermicro SYS-4029GP-TRT T419.09-py3Mixed128LJSpeech 1.1NVIDIA T4
WaveGlowCNN33,719 output samples/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed10LJSpeech 1.1NVIDIA T4
CNN247,439 output samples/sec8x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed10LJSpeech 1.1NVIDIA T4
JasperCNN13 sequences/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed32LibriSpeechNVIDIA T4
CNN93 sequences/sec8x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed32LibriSpeechNVIDIA T4
TensorFlowInception V3CNN235.05 images/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed192ImageNet2012NVIDIA T4
CNN1,767.35 images/sec8x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed192ImageNet2012NVIDIA T4
ResNet-50 V1.5CNN355.13 images/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed256ImageNet2012NVIDIA T4
CNN2,684.89 images/sec8x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed256ImageNet2012NVIDIA T4
SSD v1.2CNN62.52 images/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed32COCO 2017NVIDIA T4
CNN286 images/sec8x T4Supermicro SYS-4029GP-TRT T419.11-py3Mixed32COCO 2017NVIDIA T4
U-Net IndustrialCNN31 images/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed16DAGM2007NVIDIA T4
CNN210 images/sec8x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed2DAGM2007NVIDIA T4
PyTorchGNMT V2RNN22,204 total tokens/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed128WMT16 English-GermanNVIDIA T4
RNN135,243 total tokens/sec8x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed128WMT16 English-GermanNVIDIA T4
TensorFlowGNMT V2RNN11,771 total tokens/sec1x T4Supermicro SYS-4029GP-TRT T419.09-py3Mixed192WMT16 English-GermanNVIDIA T4
RNN54,389 total tokens/sec8x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed128WMT16 English-GermanNVIDIA T4
PyTorchNCFRecommender7,228,830 samples/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed1,048,576MovieLens 20 MillionNVIDIA T4
Recommender24,394,706 samples/sec8x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed1,048,576MovieLens 20 MillionNVIDIA T4
TensorFlowNCFRecommender10,375,820 samples/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed1,048,576MovieLens 20 MillionNVIDIA T4
Recommender18,957,309 samples/sec8x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed1,048,576MovieLens 20 MillionNVIDIA T4
TensorFlowBERTAttention9.98 sentences/sec1x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed3SQuaD v1.1NVIDIA T4
Attention45.16 sentences/sec8x T4Supermicro SYS-4029GP-TRT T419.12-py3Mixed3SQuaD v1.1NVIDIA T4

 

NVIDIA® TensorRT™ running on NVIDIA GPUs enable the most efficient deep learning inference performance across multiple application areas and models. This versatility provides wide latitude to data scientists to create the optimal low-latency solution. Visit NVIDIA GPU Cloud (NGC) to download any of these containers.

NVIDIA V100 Tensor Cores GPUs leverage mixed-precision to combine high throughput with low latencies across every type of neural network. NVIDIA P4 is an inference GPU, designed for optimal power consumption and latency, for ultra-efficient scale-out servers. Read the inference whitepaper to learn more about NVIDIA’s inference platform.

Measuring the inference performance involves balancing a lot of variables. PLASTER is an acronym that describes the key elements for measuring deep learning performance. Each letter identifies a factor (Programmability, Latency, Accuracy, Size of Model, Throughput, Energy Efficiency, Rate of Learning) that must be considered to arrive at the right set of tradeoffs and to produce a successful deep learning implementation. Refer to NVIDIA’s PLASTER whitepaper for more details.

NVIDIA landed top performance spots on all five MLPerf Inference 0.5 benchmarks with the best per-accelerator performance among commercially available products.

NVIDIA Performance on MLPerf Inference v0.5 Benchmarks (Server)

NVIDIA Turing 70W: Supermicro 4029GP-TRT-OTO-28: 1x NVIDIA T4, 2x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.7GHz | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3
NVIDIA Turing 280W: SCAN 3XS DBP T496X2 Fluid: 1x TitanRTX, 2x Intel(R) Xeon(R) 8268 | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

NVIDIA Turing 70W: Supermicro 4029GP-TRT-OTO-28: 1x NVIDIA T4, 2x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.7GHz | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3
NVIDIA Turing 280W: SCAN 3XS DBP T496X2 Fluid: 1x TitanRTX, 2x Intel(R) Xeon(R) 8268 | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

NVIDIA Performance on MLPerf Inference v0.5 Benchmarks (Offline)

NVIDIA Turing 70W: Supermicro 4029GP-TRT-OTO-28: 1x NVIDIA T4, 2x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.7GHz | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3
NVIDIA Turing 280W: SCAN 3XS DBP T496X2 Fluid: 1x TitanRTX, 2x Intel(R) Xeon(R) 8268 | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

NVIDIA Turing 70W: Supermicro 4029GP-TRT-OTO-28: 1x NVIDIA T4, 2x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.7GHz | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3
NVIDIA Turing 280W: SCAN 3XS DBP T496X2 Fluid: 1x TitanRTX, 2x Intel(R) Xeon(R) 8268 | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

MLPerf v0.5 Inference results for data center server form factors and offline and server scenarios retrieved from www.mlperf.org on Nov. 6, 2019, from entries Inf-0.5-15,Inf-0. 5-16, Inf-0.5-19, Inf-0.5-21. Inf-0.5-22, Inf-0.5-23, Inf-0.5-25, Inf-0.5-26, Inf-0.5-27. Per-processor performance is calculated by dividing the primary metric of total performance by number of accelerators reported.

MLPerf name and logo are trademarks.

NVIDIA Performance on MLPerf Inference v0.5 Benchmarks (ResNet-50 V1.5 Offline Scenario)

MLPerf v0.5 Inference results for data center server form factors and offline scenario retrieved from www.mlperf.org on Nov. 6, 2019 (Closed Inf-0.5-25 and Inf-0.5-27 for INT8, Open Inf-0.5-460 and Inf-0.5-462 for INT4). Per-processor performance is calculated by dividing the primary metric of total performance by number of accelerators reported. MLPerf name and logo are trademarks.

MLPerf Inference Performance

NVIDIA Turing 70W

NetworkNetwork
Type
Batch
Size
Throughput Efficiency LatencyGPUServerContainerPrecisionDatasetGPU
Version
MobileNet v1Server-16,884 queries/sec--1x T4Supermicro 4029GP-TRT-OTO-28--ImageNet [224x224]NVIDIA T4
MobileNet v1Offline-17,726 inputs/sec--1x T4Supermicro 4029GP-TRT-OTO-28--ImageNet [224x224]NVIDIA T4
ResNet-50 v1.5Server-5,193 queries/sec--1x T4Supermicro 4029GP-TRT-OTO-28--ImageNet [224x224]NVIDIA T4
ResNet-50 v1.5Offline-5,622 inputs/sec--1x T4Supermicro 4029GP-TRT-OTO-28--ImageNet [224x224]NVIDIA T4
SSD MobileNet v1Server-7,078 queries/sec--1x T4Supermicro 4029GP-TRT-OTO-28--COCO [300x300]NVIDIA T4
SSD MobileNet v1Offline-7,609 inputs/sec--1x T4Supermicro 4029GP-TRT-OTO-28--COCO [300x300]NVIDIA T4
SSD ResNet-34Server-126 queries/sec--1x T4Supermicro 4029GP-TRT-OTO-28--COCO [1200x1200]NVIDIA T4
SSD ResNet-34Offline-137 inputs/sec--1x T4Supermicro 4029GP-TRT-OTO-28--COCO [1200x1200]NVIDIA T4
GNMTServer-198 queries/sec--1x T4Supermicro 4029GP-TRT-OTO-28--WMT16NVIDIA T4
GNMTOffline-354 inputs/sec--1x T4Supermicro 4029GP-TRT-OTO-28--WMT16NVIDIA T4

Supermicro 4029GP-TRT-OTO-28: 1x NVIDIA T4, 2x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.7GHz | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

NVIDIA Turing 280W

NetworkNetwork
Type
Batch
Size
Throughput Efficiency LatencyGPUServerContainerPrecisionDatasetGPU
Version
MobileNet v1Server-49,775 queries/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--ImageNet [224x224]TitanRTX
MobileNet v1Offline-55,597 inputs/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--ImageNet [224x224]TitanRTX
ResNet-50 v1.5Server-15,008 queries/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--ImageNet [224x224]TitanRTX
ResNet-50 v1.5Offline-16,563 inputs/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--ImageNet [224x224]TitanRTX
SSD MobileNet v1Server-20,503 queries/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--COCO [300x300]TitanRTX
SSD MobileNet v1Offline-22,945 inputs/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--COCO [300x300]TitanRTX
SSD ResNet-34Server-388 queries/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--COCO [1200x1200]TitanRTX
SSD ResNet-34Offline-415 inputs/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--COCO [1200x1200]TitanRTX
GNMTServer-654 queries/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--WMT16TitanRTX
GNMTOffline-1,061 inputs/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--WMT16TitanRTX

SCAN 3XS DBP T496X2 Fluid: 1x TitanRTX, 2x Intel(R) Xeon(R) 8268 | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

 

Inference Image Classification on CNNs with TensorRT

ResNet-50 v1.5 Throughput

DGX-1: 1x NVIDIA V100-SXM2-16GB, E5-2698 v4 2.2 GHz | TensorRT 6.0 | Batch Size = 128 | 19.12-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x NVIDIA T4, Gold 6240 2.6 GHz | TensorRT 6.0 | Batch Size = 128 | 19.12-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Latency

DGX-2: 1x NVIDIA V100-SXM3-32GB, Xeon Platinum 8168 2.7 GHz | TensorRT 6.0 | Batch Size = 1 | 19.12-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x NVIDIA T4, Gold 6240 2.6 GHz | TensorRT 6.0 | Batch Size = 1 | 19.12-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Power Efficiency

DGX-1: 1x NVIDIA V100-SXM2-16GB, E5-2698 v4 2.2 GHz | TensorRT 6.0 | Batch Size = 128 | 19.12-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x NVIDIA T4, Gold 6240 2.6 GHz | TensorRT 6.0 | Batch Size = 128 | 19.12-py3 | Precision: INT8 | Dataset: Synthetic

 

Inference Performance

V100 Inference Performance

NetworkNetwork
Type
Batch
Size
Throughput Efficiency LatencyGPUServerContainerPrecisionDatasetFrameworkGPU Version
GoogleNetCNN11,539 images/sec14 images/sec/watt0.651x V100DGX-119.12-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN22,080 images/sec18 images/sec/watt0.961x V100DGX-119.12-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN85,155 images/sec36 images/sec/watt1.61x V100DGX-119.12-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN8312,174 images/sec45 images/sec/watt6.81x V100DGX-119.12-py3MixedSyntheticTensorRT 6.0V100-SXM2-16GB
CNN12812,446 images/sec45 images/sec/watt101x V100DGX-119.12-py3MixedSyntheticTensorRT 6.0V100-SXM2-16GB
MobileNet V1CNN14,695 images/sec35 images/sec/watt0.211x V100DGX-119.12-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN26,493 images/sec45 images/sec/watt0.311x V100DGX-119.12-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN815,621 images/sec73 images/sec/watt0.511x V100DGX-219.12-py3INT8SyntheticTensorRT 6.0V100-SXM3-32GB
CNN22030,224 images/sec104 images/sec/watt6.91x V100DGX-119.12-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN12829,481 images/sec101 images/sec/watt4.31x V100DGX-119.12-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
ResNet-50CNN11,147 images/sec8.4 images/sec/watt0.871x V100DGX-119.12-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN21,529 images/sec9.5 images/sec/watt1.31x V100DGX-119.12-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN83,394 images/sec19 images/sec/watt2.41x V100DGX-119.12-py3MixedSyntheticTensorRT 6.0V100-SXM2-16GB
CNN527,617 images/sec27 images/sec/watt6.81x V100DGX-119.12-py3MixedSyntheticTensorRT 6.0V100-SXM2-16GB
CNN1287,744 images/sec27 images/sec/watt171x V100DGX-119.12-py3MixedSyntheticTensorRT 6.0V100-SXM2-16GB
CNN1287,924 images/sec24 images/sec/watt161x V100DGX-219.12-py3MixedSyntheticTensorRT 6.0V100-SXM3-32GB
ResNet-50v1.5CNN11,011 images/sec6.1 images/sec/watt0.991x V100DGX-219.12-py3INT8SyntheticTensorRT 6.0V100-SXM3-32GB
CNN21,404 images/sec9.4 images/sec/watt1.41x V100DGX-119.12-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN83,269 images/sec18 images/sec/watt2.51x V100DGX-119.12-py3MixedSyntheticTensorRT 6.0V100-SXM2-16GB
CNN517,379 images/sec23 images/sec/watt6.91x V100DGX-219.12-py3MixedSyntheticTensorRT 6.0V100-SXM3-32GB
CNN1287,191 images/sec25 images/sec/watt181x V100DGX-119.12-py3MixedSyntheticTensorRT 6.0V100-SXM2-16GB
CNN1287,542 images/sec23 images/sec/watt171x V100DGX-219.12-py3MixedSyntheticTensorRT 6.0V100-SXM3-32GB
VGG16CNN1868 images/sec3.6 images/sec/watt1.21x V100DGX-219.12-py3INT8SyntheticTensorRT 6.0V100-SXM3-32GB
CNN2960 images/sec4.7 images/sec/watt2.11x V100DGX-119.12-py3MixedSyntheticTensorRT 6.0V100-SXM2-16GB
CNN81,923 images/sec6.8 images/sec/watt4.21x V100DGX-219.12-py3MixedSyntheticTensorRT 6.0V100-SXM3-32GB
CNN162,379 images/sec7.6 images/sec/watt6.71x V100DGX-219.12-py3MixedSyntheticTensorRT 6.0V100-SXM3-32GB
CNN1282,797 images/sec9.6 images/sec/watt461x V100DGX-119.12-py3MixedSyntheticTensorRT 6.0V100-SXM2-32GB
NMTRNN14,013 total tokens/sec - 131x V100DGX-1-Mixedwmt16-English-GermanTensorRT 5.1V100-SXM2-32GB
RNN26,290 total tokens/sec - 161x V100DGX-1-Mixedwmt16-English-GermanTensorRT 5.1V100-SXM2-32GB
RNN6456,531 total tokens/sec - 581x V100DGX-1-Mixedwmt16-English-GermanTensorRT 5.1V100-SXM2-32GB
RNN12873,375 total tokens/sec - 891x V100DGX-1-Mixedwmt16-English-GermanTensorRT 5.1V100-SXM2-32GB
NCFRecommender126,418 samples/sec335 samples/sec/watt0.041x V100DGX-219.12-py3FP32SyntheticTensorRT 6.0V100-SXM3-32GB
Recommender641,630,561 samples/sec17,549 samples/sec/watt0.041x V100DGX-219.12-py3FP32SyntheticTensorRT 6.0V100-SXM3-32GB
Recommender25,00091,055,121 samples/sec609,642 samples/sec/watt0.271x V100DGX-119.12-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
Recommender100,000121,238,130 samples/sec582,641 samples/sec/watt0.821x V100DGX-219.12-py3INT8SyntheticTensorRT 6.0V100-SXM3-32GB
BERT-BASEAttention1557 sentences/sec10.3 sentences/sec/watt1.81x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
Attention2978 sentences/sec18.8 sentences/sec/watt21x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
Attention81,847 sentences/sec34.1 sentences/sec/watt4.31x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
Attention242,419 sentences/sec43.7 sentences/sec/watt9.91x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
Attention1282,645 sentences/sec46 sentences/sec/watt48.41x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
BERT-LARGEAttention1239 sentences/sec4.3 sentences/sec/watt4.21x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
Attention2407 sentences/sec7.5 sentences/sec/watt4.91x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
Attention4562 sentences/sec10.6 sentences/sec/watt7.11x V101Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
Attention8636 sentences/sec11.8 sentences/sec/watt12.61x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
Attention128823 sentences/sec13.6 sentences/sec/watt155.51x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB

NGC: TensorRT Container | Download TensorRT 6
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

T4 Inference Performance

NetworkNetwork
Type
Batch
Size
Throughput Efficiency LatencyGPUServerContainerPrecisionDatasetFrameworkGPU
Version
GoogleNetCNN11,694 images/sec28 images/sec/watt0.591x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN22,473 images/sec40 images/sec/watt0.811x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN86,282 images/sec90 images/sec/watt1.31x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN598,659 images/sec124 images/sec/watt6.81x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN1288,839 images/sec127 images/sec/watt141x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
MobileNet V1CNN14,542 images/sec84 images/sec/watt0.221x T4Supermicro SYS-4029GP-TRT T419.10-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN27,539 images/sec123 images/sec/watt0.271x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN814,360 images/sec206 images/sec/watt0.561x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN12017,770 images/sec254 images/sec/watt6.81x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN12817,697 images/sec254 images/sec/watt7.21x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
ResNet-50CNN11,152 images/sec16 images/sec/watt0.871x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN21,685 images/sec24 images/sec/watt1.21x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN84,049 images/sec58 images/sec/watt21x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN355,240 images/sec75 images/sec/watt6.71x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN1285,758 images/sec82 images/sec/watt221x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
ResNet-50v1.5CNN11,094 images/sec16 images/sec/watt0.911x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN21,714 images/sec25 images/sec/watt1.21x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN83,869 images/sec56 images/sec/watt21x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN345,040 images/sec72 images/sec/watt6.81x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN1285,259 images/sec75 images/sec/watt241x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
VGG16CNN1799 images/sec11 images/sec/watt1.31x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN21,093 images/sec16 images/sec/watt1.81x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN81,628 images/sec23 images/sec/watt4.91x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN111,723 images/sec25 images/sec/watt6.41x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN1281,885 images/sec27 images/sec/watt681x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
NCFRecommender113,169 samples/sec345 samples/sec/watt0.081x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
Recommender641,026,669 samples/sec26,713 samples/sec/watt0.061x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
Recommender25,00051,189,454 samples/sec733,884 samples/sec/watt0.491x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
Recommender100,00054,236,289 samples/sec776,160 samples/sec/watt1.81x T4Supermicro SYS-4029GP-TRT T419.12-py3INT8SyntheticTensorRT 6.0NVIDIA T4
BERT-BASEAttention1484 sentences/sec11 sentences/sec/watt2.071x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1TensorRT 6.0NVIDIA T4
Attention2754 sentences/sec17 sentences/sec/watt2.651x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1TensorRT 6.0NVIDIA T4
Attention8827 sentences/sec20 sentences/sec/watt9.671x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1TensorRT 6.0NVIDIA T4
Attention128800 sentences/sec16 sentences/sec/watt160.021x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1TensorRT 6.0NVIDIA T4
BERT-LARGEAttention1171 sentences/sec4 sentences/sec/watt5.841x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1TensorRT 6.0NVIDIA T4
Attention2168 sentences/sec4 sentences/sec/watt11.881x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1TensorRT 6.0NVIDIA T4
Attention8244 sentences/sec6 sentences/sec/watt32.741x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1TensorRT 6.0NVIDIA T4
Attention128254 sentences/sec5 sentences/sec/watt504.331x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1TensorRT 6.0NVIDIA T4

NGC: TensorRT Container | Download TensorRT 6
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

Last updated: Feb 9th, 2020