Review the latest GPU acceleration factors of popular HPC applications.


NVIDIA’s complete solution stack, from GPUs to libraries, and containers on NVIDIA GPU Cloud (NGC), allows data scientists to quickly get up and running with deep learning. NVIDIA® V100 Tensor Core GPUs leverage mixed precision to accelerate deep learning training throughputs across every framework and every type of neural network. NVIDIA breaks performance records on MLPerf, the AI’s first industry-wide benchmark, a testament to our GPU-accelerated platform approach.

NVIDIA Performance on MLPerf 0.6 AI Benchmarks

ResNet-50 v1.5 Time to Solution on V100

MXNet | Batch Size refer to CNN V100 Training table below | Precision: Mixed | Dataset: ImageNet2012 | Convergence criteria - refer to MLPerf requirements

Training Performance

NVIDIA Performance on MLPerf 0.6 AI Benchmarks

FrameworkNetworkNetwork TypeTime to Solution GPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetResNet-50 v1.5CNN115.22 minutes8x V100DGX-10.6-8MixedImageNet2012V100-SXM2-16GB
CNN57.87 minutes16x V100DGX-20.6-17MixedImageNet2012V100-SXM3-32GB
CNN52.74 minutes16x V100DGX-2H0.6-19MixedImageNet2012V100-SXM3-32GB-H
CNN2.59 minutes512x V100DGX-2H0.6-29MixedImageNet2012V100-SXM3-32GB-H
CNN1.69 minutes1040x V100DGX-10.6-16MixedImageNet2012V100-SXM2-16GB
CNN1.33 minutes1536x V100DGX-2H0.6-30MixedImageNet2012V100-SXM3-32GB-H
PyTorchSSD-ResNet-34CNN22.36 minutes8x V100DGX-10.6-9MixedCOCO2017V100-SXM2-16GB
CNN12.21 minutes16x V100DGX-20.6-18MixedCOCO2017V100-SXM3-32GB
CNN11.41 minutes16x V100DGX-2H0.6-20MixedCOCO2017V100-SXM3-32GB-H
CNN4.78 minutes64x V100DGX-2H0.6-21MixedCOCO2017V100-SXM3-32GB-H
CNN2.67 minutes240x V100DGX-10.6-13MixedCOCO2017V100-SXM2-16GB
CNN2.56 minutes240x V100DGX-2H0.6-24MixedCOCO2017V100-SXM3-32GB-H
CNN2.23 minutes240x V100DGX-2H0.6-27MixedCOCO2017V100-SXM3-32GB-H
Mask R-CNNCNN207.48 minutes8x V100DGX-10.6-9MixedCOCO2017V100-SXM2-16GB
CNN101 minutes16x V100DGX-20.6-18MixedCOCO2017V100-SXM3-32GB
CNN95.2 minutes16x V100DGX-2H0.6-20MixedCOCO2017V100-SXM3-32GB-H
CNN32.72 minutes64x V100DGX-2H0.6-21MixedCOCO2017V100-SXM3-32GB-H
CNN22.03 minutes192x V100DGX-10.6-12MixedCOCO2017V100-SXM2-16GB
CNN18.47 minutes192x V100DGX-2H0.6-23MixedCOCO2017V100-SXM3-32GB-H
PyTorchGNMTRNN20.55 minutes8x V100DGX-10.6-9MixedWMT16 English-GermanV100-SXM2-16GB
RNN10.94 minutes16x V100DGX-20.6-18MixedWMT16 English-GermanV100-SXM3-32GB
RNN9.87 minutes16x V100DGX-2H0.6-20MixedWMT16 English-GermanV100-SXM3-32GB-H
RNN2.12 minutes256x V100DGX-2H0.6-25MixedWMT16 English-GermanV100-SXM3-32GB-H
RNN1.99 minutes384x V100DGX-10.6-14MixedWMT16 English-GermanV100-SXM2-16GB
RNN1.8 minutes384x V100DGX-2H0.6-26MixedWMT16 English-GermanV100-SXM3-32GB-H
PyTorchTransformerAttention20.34 minutes8x V100DGX-10.6-9MixedWMT17 English-GermanV100-SXM2-16GB
Attention11.04 minutes16x V100DGX-20.6-18MixedWMT17 English-GermanV100-SXM3-32GB
Attention9.8 minutes16x V100DGX-2H0.6-20MixedWMT17 English-GermanV100-SXM3-32GB-H
Attention2.41 minutes160x V100DGX-2H0.6-22MixedWMT17 English-GermanV100-SXM3-32GB-H
Attention2.05 minutes480x V100DGX-10.6-15MixedWMT17 English-GermanV100-SXM2-16GB
Attention1.59 minutes480x V100DGX-2H0.6-28MixedWMT17 English-GermanV100-SXM3-32GB-H
TensorFlowMiniGoReinforcement Learning27.39 minutes8x V100DGX-10.6-10MixedN/AV100-SXM2-16GB
Reinforcement Learning13.57 minutes24x V100DGX-10.6-11MixedN/AV100-SXM2-16GB

Training Image Classification on CNNs

ResNet-50 V1.5 Throughput on V100

DGX-1: 8x NVIDIA V100-SXM2-32GB for MXNet and TensorFlow. NVIDIA V100-SXM2-16GB for PyTorch, E5-2698 v4 2.2 GHz | Batch Size = 256 for MXNet, PyTorch and TensorFlow = 512 | MXNet and PyTorch = 19.11-py3, TensorFlow = 19.11-tf1_-py3 | Precision: Mixed | Dataset: ImageNet2012

ResNet-50 V1.5 Throughput on T4

Supermicro SYS-4029GP-TRT T4: 8x NVIDIA T4, Gold 6240 2.6 GHz for MXNet and TensorFlow, Gold 6240 2.6GHz for PyTorch | Batch Size = 208 for MXNet, PyTorch and TensorFlow = 256 | MXNet and PyTorch = 19.11-py3, TensorFlow = 19.11-tf1_-py3 | Precision: Mixed | Dataset: ImageNet2012

V100 Training Performance

FrameworkNetworkNetwork TypeThroughput GPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNetInception V3CNN553 images/sec1x V100DGX-119.11-py3Mixed208ImageNet2012V100-SXM2-16GB
CNN623 images/sec1x V100DGX-2H19.11-py3Mixed384ImageNet2012V100-SXM3-32GB-H
CNN4,256 images/sec8x V100DGX-119.11-py3Mixed208ImageNet2012V100-SXM2-16GB
CNN4,386 images/sec8x V100DGX-219.11-py3Mixed384ImageNet2012V100-SXM3-32GB
ResNet-50CNN1,382 images/sec1x V100DGX-119.11-py3Mixed208ImageNet2012V100-SXM2-16GB
CNN1,445 images/sec1x V100DGX-219.11-py3Mixed512ImageNet2012V100-SXM3-32GB
CNN1,551 images/sec1x V100DGX-2H19.11-py3Mixed512ImageNet2012V100-SXM3-32GB-H
CNN10,358 images/sec8x V100DGX-119.11-py3Mixed192ImageNet2012V100-SXM2-16GB
CNN10,805 images/sec8x V100DGX-219.11-py3Mixed256ImageNet2012V100-SXM3-32GB
CNN11,507 images/sec8x V100DGX-2H19.11-py3Mixed256ImageNet2012V100-SXM3-32GB-H
ResNet-50 V1.5CNN1,456 images/sec1x V100DGX-119.11-py3Mixed208ImageNet2012V100-SXM2-16GB
CNN1,658 images/sec1x V100DGX-2H19.11-py3Mixed256ImageNet2012V100-SXM3-32GB-H
CNN10,544 images/sec8x V100DGX-119.11-py3Mixed256ImageNet2012V100-SXM2-32GB
CNN11,124 images/sec8x V100DGX-219.11-py3Mixed256ImageNet2012V100-SXM3-32GB
CNN12,012 images/sec8x V100DGX-2H19.11-py3Mixed256ImageNet2012V100-SXM3-32GB-H
PyTorchInception V3CNN546 images/sec1x V100DGX-119.11-py3Mixed256ImageNet2012V100-SXM2-16GB
CNN607 images/sec1x V100DGX-2H19.11-py3Mixed512ImageNet2012V100-SXM3-32GB-H
CNN4,242 images/sec8x V100DGX-119.11-py3Mixed256ImageNet2012V100-SXM2-16GB
Mask R-CNNCNN16 images/sec1x V100DGX-119.11-py3Mixed16COCO2014V100-SXM2-32GB
CNN17 images/sec1x V100DGX-219.09-py3Mixed16COCO2014V100-SXM3-32GB
CNN92 images/sec8x V100DGX-119.11-py3Mixed16COCO2014V100-SXM2-32GB
ResNet-50CNN905 images/sec1x V100DGX-119.11-py3Mixed256ImageNet2012V100-SXM2-16GB
CNN926 images/sec1x V100DGX-219.11-py3Mixed512ImageNet2012V100-SXM3-32GB
CNN1,025 images/sec1x V100DGX-2H19.11-py3Mixed512ImageNet2012V100-SXM3-32GB-H
CNN6,179 images/sec8x V100DGX-119.11-py3Mixed256ImageNet2012V100-SXM2-16GB
CNN6,595 images/sec8x V100DGX-219.11-py3Mixed512ImageNet2012V100-SXM3-32GB
CNN7,151 images/sec8x V100DGX-2H19.11-py3Mixed512ImageNet2012V100-SXM3-32GB-H
ResNet-50 V1.5CNN916 images/sec1x V100DGX-119.11-py3Mixed256ImageNet2012V100-SXM2-16GB
CNN1,028 images/sec1x V100DGX-2H19.11-py3Mixed512ImageNet2012V100-SXM3-32GB-H
CNN7,246 images/sec8x V100DGX-119.11-py3Mixed256ImageNet2012V100-SXM2-16GB
SSD v1.1CNN234 images/sec1x V100DGX-119.11-py3Mixed64COCO 2017V100-SXM2-16GB
CNN265 images/sec1x V100DGX-2H19.11-py3Mixed64COCO 2017V100-SXM3-32GB-H
CNN2,132 images/sec8x V100DGX-119.06-py3Mixed64COCO 2017V100-SXM2-16GB
Tacotron2CNN19,621 total output mels/sec1x V100DGX-119.11-py3Mixed128LJ Speech 1.1V100-SXM2-32GB
CNN24,269 total output mels/sec1x V100DGX-2H19.11-py3Mixed128LJ Speech 1.1V100-SXM3-32GB-H
CNN116,696 total output mels/sec8x V100DGX-119.11-py3Mixed128LJ Speech 1.1V100-SXM2-32GB
CNN150,263 total output mels/sec8x V100DGX-2H19.11-py3Mixed128LJ Speech 1.1V100-SXM3-32GB-H
WaveGlowCNN82,692 output samples/sec1x V100DGX-119.08-py3Mixed10LJ Speech 1.1V100-SXM2-16GB
CNN92,280 output samples/sec1x V100DGX-2H19.10-py3Mixed10LJ Speech 1.1V100-SXM3-32GB-H
CNN557,814 output samples/sec8x V100DGX-119.10-py3Mixed10LJ Speech 1.1V100-SXM2-16GB
TensorFlowInception V3CNN719 images/sec1x V100DGX-119.11-tf1_-py3Mixed384ImageNet2012V100-SXM2-32GB
CNN832 images/sec1x V100DGX-2H19.11-tf1_-py3Mixed384ImageNet2012V100-SXM3-32GB-H
CNN5,540 images/sec8x V100DGX-119.11-tf1_-py3Mixed384ImageNet2012V100-SXM2-32GB
CNN5,719 images/sec8x V100DGX-2H19.11-tf1_-py3Mixed384ImageNet2012V100-SXM3-32GB-H
ResNet-50 V1.5CNN1,057 images/sec1x V100DGX-119.11-tf1_-py3Mixed256ImageNet2012V100-SXM2-16GB
CNN1,237 images/sec1x V100DGX-2H19.11-tf1_-py3Mixed512ImageNet2012V100-SXM3-32GB-H
CNN8,148 images/sec8x V100DGX-119.11-tf1_-py3Mixed512ImageNet2012V100-SXM2-32GB
SSD v1.1CNN128 images/sec1x V100DGX-119.11-tf1_-py3Mixed32COCO 2017V100-SXM2-16GB
CNN148 images/sec1x V100DGX-2H19.11-tf1_-py3Mixed32COCO 2017V100-SXM3-32GB-H
CNN612 images/sec8x V100DGX-119.11-tf1_-py3Mixed32COCO 2017V100-SXM2-16GB
CNN719 images/sec8x V100DGX-2H19.11-tf1_-py3Mixed32COCO 2017V100-SXM3-32GB-H
U-Net IndustrialCNN100 images/sec1x V100DGX-119.11-tf1_-py3Mixed16DAGM2007V100-SXM2-16GB
CNN111 images/sec1x V100DGX-2H19.11-tf1_-py3Mixed16DAGM2007V100-SXM3-32GB-H
CNN526 images/sec8x V100DGX-119.11-tf1_-py3Mixed2DAGM2007V100-SXM2-16GB
CNN569 images/sec8x V100DGX-2H19.11-tf1_-py3Mixed2DAGM2007V100-SXM3-32GB-H
PyTorchGNMT V2RNN76,722 total tokens/sec1x V100DGX-119.08-py3Mixed512WMT16 English-GermanV100-SXM2-32GB
RNN83,948 total tokens/sec1x V100DGX-219.08-py3Mixed512WMT16 English-GermanV100-SXM3-32GB
RNN585,249 total tokens/sec8x V100DGX-119.08-py3Mixed512WMT16 English-GermanV100-SXM2-32GB
RNN606,769 total tokens/sec8x V100DGX-219.08-py3Mixed512WMT16 English-GermanV100-SXM3-32GB
TensorFlowGNMT V2RNN25,159 total tokens/sec1x V100DGX-119.10-py3Mixed192WMT16 English-GermanV100-SXM2-32GB
RNN29,972 total tokens/sec1x V100DGX-2H19.10-py3Mixed192WMT16 English-GermanV100-SXM3-32GB-H
RNN163,445 total tokens/sec8x V100DGX-119.10-py3Mixed192WMT16 English-GermanV100-SXM2-32GB
PyTorchNCFRecommender21,380,777 samples/sec1x V100DGX-119.11-py3Mixed1048576MovieLens 20 MillionV100-SXM2-16GB
Recommender23,695,482 samples/sec1x V100DGX-2H19.11-py3Mixed1048576MovieLens 20 MillionV100-SXM3-32GB-H
Recommender98,988,565 samples/sec8x V100DGX-119.11-py3Mixed1048576MovieLens 20 MillionV100-SXM2-16GB
Recommender105,757,631 samples/sec8x V100DGX-2H19.11-py3Mixed1048576MovieLens 20 MillionV100-SXM3-32GB-H
TensorFlowNCFRecommender26,518,590 samples/sec1x V100DGX-119.11-tf1_py3Mixed1048576MovieLens 20 MillionV100-SXM2-16GB
Recommender28,134,326 samples/sec1x V100DGX-2H19.11-tf1_py3Mixed1048576MovieLens 20 MillionV100-SXM3-32GB-H
Recommender75,865,254 samples/sec8x V100DGX-119.11-tf1_py3Mixed1048576MovieLens 20 MillionV100-SXM2-16GB
PyTorchBERT-LARGEAttention48 sentences/sec1x V100DGX-119.11-py3Mixed10SQuaD v1.1V100-SXM2-32GB
Attention57 sentences/sec1x V100DGX-2H19.11-py3Mixed10SQuaD v1.1V100-SXM3-32GB-H
Attention345 sentences/sec8xV100DGX-119.11-py3Mixed10SQuaD v1.1V100-SXM2-32GB
Attention396 sentences/sec8xV100DGX-2H19.11-py3Mixed10SQuaD v1.1V100-SXM3-32GB-H
TensorFlowBERT-LARGEAttention38 sentences/sec1x V100DGX-119.11-tf1_py3Mixed10SQuaD v1.1V100-SXM2-32GB
Attention44 sentences/sec1x V100DGX-2H19.11-tf1_py3Mixed10SQuaD v1.1V100-SXM3-32GB-H
Attention185 sentences/sec8x V100DGX-119.11-tf1_py3Mixed10SQuaD v1.1V100-SXM2-32GB
Attention222 sentences/sec8x V100DGX-2H19.11-tf1_py3Mixed10SQuaD v1.1V100-SXM3-32GB-H

T4 Training Performance

FrameworkNetworkNetwork TypeThroughput GPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNetInception V3CNN184 images/sec1x T4Supermicro SYS-4029GP-TRT T419.11-py3Mixed208ImageNet2012NVIDIA T4
CNN1,442 images/sec8x T4Supermicro SYS-4029GP-TRT T419.11-py3Mixed208ImageNet2012NVIDIA T4
ResNet-50 v1.5CNN483 images/sec1x T4Supermicro SYS-4029GP-TRT T419.11-py3Mixed208ImageNet2012NVIDIA T4
CNN3,735 images/sec8x T4Supermicro SYS-4029GP-TRT T419.11-py3Mixed208ImageNet2012NVIDIA T4
PyTorchInception V3CNN176 images/sec1x T4Supermicro SYS-4029GP-TRT T419.11-py3Mixed256ImageNet2012NVIDIA T4
CNN1,373 images/sec8x T4Supermicro SYS-4029GP-TRT T419.11-py3Mixed256ImageNet2012NVIDIA T4
Mask R-CNNCNN6 images/sec1x T4Supermicro SYS-4029GP-TRT T419.11-py3Mixed4COCO2014NVIDIA T4
CNN37 images/sec8x T4Supermicro SYS-4029GP-TRT T419.11-py3Mixed4COCO2014NVIDIA T4
ResNet-50 V1.5CNN280 images/sec1x T4Supermicro SYS-4029GP-TRT T419.11-py3Mixed256ImageNet2012NVIDIA T4
CNN2,232 images/sec8x T4Supermicro SYS-4029GP-TRT T419.11-py3Mixed256ImageNet2012NVIDIA T4
SSD v1.1CNN77 images/sec1x T4Supermicro SYS-4029GP-TRT T419.11-py3Mixed64COCO 2017NVIDIA T4
CNN629 images/sec8x T4Supermicro SYS-4029GP-TRT T419.11-py3Mixed64COCO 2017NVIDIA T4
Tacotron2CNN14,538 total output mels/sec1x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed128LJ Speech 1.1NVIDIA T4
CNN103,679 total output mels/sec8x T4Supermicro SYS-4029GP-TRT T419.09-py3Mixed128LJ Speech 1.1NVIDIA T4
WaveGlowCNN34,414 output samples/sec1x T4Supermicro SYS-4029GP-TRT T419.11-py3Mixed10LJ Speech 1.1NVIDIA T4
CNN247,439 output samples/sec8x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed10LJ Speech 1.1NVIDIA T4
TensorFlowInception V3CNN236 images/sec1x T4Supermicro SYS-4029GP-TRT T419.11-tf1_py3Mixed192ImageNet2012NVIDIA T4
CNN1,725 images/sec8x T4Supermicro SYS-4029GP-TRT T419.11-tf1_py3Mixed192ImageNet2012NVIDIA T4
ResNet-50 V1.5CNN356 images/sec1x T4Supermicro SYS-4029GP-TRT T419.11-tf1_py3Mixed256ImageNet2012NVIDIA T4
CNN2,700 images/sec8x T4Supermicro SYS-4029GP-TRT T419.11-tf1_py3Mixed256ImageNet2012NVIDIA T4
SSD v1.1CNN61 images/sec1x T4Supermicro SYS-4029GP-TRT T419.11-tf1_py3Mixed32COCO 2017NVIDIA T4
CNN286 images/sec8x T4Supermicro SYS-4029GP-TRT T419.11-tf1_py3Mixed32COCO 2017NVIDIA T4
U-Net IndustrialCNN31 images/sec1x T4Supermicro SYS-4029GP-TRT T419.11-tf1_py3Mixed16DAGM2007NVIDIA T4
CNN210 images/sec8x T4Supermicro SYS-4029GP-TRT T419.11-tf1_py3Mixed2DAGM2007NVIDIA T4
PyTorchGNMT V2RNN26,083 total tokens/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed256WMT16 English-GermanNVIDIA T4
RNN181,049 total tokens/sec8x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed256WMT16 English-GermanNVIDIA T4
TensorFlowGNMT V2RNN11,771 total tokens/sec1x T4Supermicro SYS-4029GP-TRT T419.09-py3Mixed192WMT16 English-GermanNVIDIA T4
RNN54,548 total tokens/sec8x T4Supermicro SYS-4029GP-TRT T419.11-tf1_py3Mixed128WMT16 English-GermanNVIDIA T4
PyTorchNCFRecommender7,225,075 samples/sec1x T4Supermicro SYS-4029GP-TRT T419.11-py3Mixed1048576MovieLens 20 MillionNVIDIA T4
Recommender24,494,923 samples/sec8x T4Supermicro SYS-4029GP-TRT T419.11-py3Mixed1048576MovieLens 20 MillionNVIDIA T4
TensorFlowNCFRecommender10,301,645 samples/sec1x T4Supermicro SYS-4029GP-TRT T419.11-tf1_py3Mixed1048576MovieLens 20 MillionNVIDIA T4
Recommender18,911,157 samples/sec8x T4Supermicro SYS-4029GP-TRT T419.11-tf1_py3Mixed1048576MovieLens 20 MillionNVIDIA T4
TensorFlowBERTAttention10 sentences/sec1x T4Supermicro SYS-4029GP-TRT T419.11-tf1_py3Mixed3SQuaD v1.1NVIDIA T4
Attention45 sentences/sec8x T4Supermicro SYS-4029GP-TRT T419.11-tf1_py3Mixed3SQuaD v1.1NVIDIA T4

 

NVIDIA® TensorRT™ running on NVIDIA GPUs enable the most efficient deep learning inference performance across multiple application areas and models. This versatility provides wide latitude to data scientists to create the optimal low-latency solution. Visit NVIDIA GPU Cloud (NGC) to download any of these containers.

NVIDIA V100 Tensor Cores GPUs leverage mixed-precision to combine high throughput with low latencies across every type of neural network. NVIDIA P4 is an inference GPU, designed for optimal power consumption and latency, for ultra-efficient scale-out servers. Read the inference whitepaper to learn more about NVIDIA’s inference platform.

Measuring the inference performance involves balancing a lot of variables. PLASTER is an acronym that describes the key elements for measuring deep learning performance. Each letter identifies a factor (Programmability, Latency, Accuracy, Size of Model, Throughput, Energy Efficiency, Rate of Learning) that must be considered to arrive at the right set of tradeoffs and to produce a successful deep learning implementation. Refer to NVIDIA’s PLASTER whitepaper for more details.

NVIDIA landed top performance spots on all five MLPerf Inference 0.5 benchmarks with the best per-accelerator performance among commercially available products.

NVIDIA Performance on MLPerf Inference v0.5 Benchmarks (Server)

NVIDIA Turing 70W: Supermicro 4029GP-TRT-OTO-28: 1x NVIDIA T4, 2x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.7GHz | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3
NVIDIA Turing 280W: SCAN 3XS DBP T496X2 Fluid: 1x TitanRTX, 2x Intel(R) Xeon(R) 8268 | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

NVIDIA Turing 70W: Supermicro 4029GP-TRT-OTO-28: 1x NVIDIA T4, 2x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.7GHz | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3
NVIDIA Turing 280W: SCAN 3XS DBP T496X2 Fluid: 1x TitanRTX, 2x Intel(R) Xeon(R) 8268 | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

NVIDIA Performance on MLPerf Inference v0.5 Benchmarks (Offline)

NVIDIA Turing 70W: Supermicro 4029GP-TRT-OTO-28: 1x NVIDIA T4, 2x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.7GHz | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3
NVIDIA Turing 280W: SCAN 3XS DBP T496X2 Fluid: 1x TitanRTX, 2x Intel(R) Xeon(R) 8268 | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

NVIDIA Turing 70W: Supermicro 4029GP-TRT-OTO-28: 1x NVIDIA T4, 2x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.7GHz | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3
NVIDIA Turing 280W: SCAN 3XS DBP T496X2 Fluid: 1x TitanRTX, 2x Intel(R) Xeon(R) 8268 | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

MLPerf v0.5 Inference results for data center server form factors and offline and server scenarios retrieved from www.mlperf.org on Nov. 6, 2019, from entries Inf-0.5-15,Inf-0. 5-16, Inf-0.5-19, Inf-0.5-21. Inf-0.5-22, Inf-0.5-23, Inf-0.5-25, Inf-0.5-26, Inf-0.5-27. Per-processor performance is calculated by dividing the primary metric of total performance by number of accelerators reported.

MLPerf name and logo are trademarks.

NVIDIA Performance on MLPerf Inference v0.5 Benchmarks (ResNet-50 V1.5 Offline Scenario)

MLPerf v0.5 Inference results for data center server form factors and offline scenario retrieved from www.mlperf.org on Nov. 6, 2019 (Closed Inf-0.5-25 and Inf-0.5-27 for INT8, Open Inf-0.5-460 and Inf-0.5-462 for INT4). Per-processor performance is calculated by dividing the primary metric of total performance by number of accelerators reported. MLPerf name and logo are trademarks.

MLPerf Inference Performance

NVIDIA Turing 70W

NetworkNetwork
Type
Batch
Size
Throughput Efficiency LatencyGPUServerContainerPrecisionDatasetGPU
Version
MobileNet v1Server-16,884 queries/sec--1x T4Supermicro 4029GP-TRT-OTO-28--ImageNet [224x224]NVIDIA T4
MobileNet v1Offline-17,726 inputs/sec--1x T4Supermicro 4029GP-TRT-OTO-28--ImageNet [224x224]NVIDIA T4
ResNet-50 v1.5Server-5,193 queries/sec--1x T4Supermicro 4029GP-TRT-OTO-28--ImageNet [224x224]NVIDIA T4
ResNet-50 v1.5Offline-5,622 inputs/sec--1x T4Supermicro 4029GP-TRT-OTO-28--ImageNet [224x224]NVIDIA T4
SSD MobileNet v1Server-7,078 queries/sec--1x T4Supermicro 4029GP-TRT-OTO-28--COCO [300x300]NVIDIA T4
SSD MobileNet v1Offline-7,609 inputs/sec--1x T4Supermicro 4029GP-TRT-OTO-28--COCO [300x300]NVIDIA T4
SSD ResNet-34Server-126 queries/sec--1x T4Supermicro 4029GP-TRT-OTO-28--COCO [1200x1200]NVIDIA T4
SSD ResNet-34Offline-137 inputs/sec--1x T4Supermicro 4029GP-TRT-OTO-28--COCO [1200x1200]NVIDIA T4
GNMTServer-198 queries/sec--1x T4Supermicro 4029GP-TRT-OTO-28--WMT16NVIDIA T4
GNMTOffline-354 inputs/sec--1x T4Supermicro 4029GP-TRT-OTO-28--WMT16NVIDIA T4

Supermicro 4029GP-TRT-OTO-28: 1x NVIDIA T4, 2x Intel(R) Xeon(R) Platinum 8280 CPU @ 2.7GHz | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

NVIDIA Turing 280W

NetworkNetwork
Type
Batch
Size
Throughput Efficiency LatencyGPUServerContainerPrecisionDatasetGPU
Version
MobileNet v1Server-49,775 queries/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--ImageNet [224x224]TitanRTX
MobileNet v1Offline-55,597 inputs/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--ImageNet [224x224]TitanRTX
ResNet-50 v1.5Server-15,008 queries/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--ImageNet [224x224]TitanRTX
ResNet-50 v1.5Offline-16,563 inputs/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--ImageNet [224x224]TitanRTX
SSD MobileNet v1Server-20,503 queries/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--COCO [300x300]TitanRTX
SSD MobileNet v1Offline-22,945 inputs/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--COCO [300x300]TitanRTX
SSD ResNet-34Server-388 queries/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--COCO [1200x1200]TitanRTX
SSD ResNet-34Offline-415 inputs/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--COCO [1200x1200]TitanRTX
GNMTServer-654 queries/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--WMT16TitanRTX
GNMTOffline-1,061 inputs/sec--1x TitanRTXSCAN 3XS DBP T496X2 Fluid--WMT16TitanRTX

SCAN 3XS DBP T496X2 Fluid: 1x TitanRTX, 2x Intel(R) Xeon(R) 8268 | TensorRT 6.0 | CUDA 10.1 | cuDNN 7.6.3

 

Inference Image Classification on CNNs with TensorRT

ResNet-50 v1.5 Throughput

DGX-1: 1x NVIDIA V100-SXM2-16GB, E5-2698 v4 2.2 GHz | TensorRT 6.0 | Batch Size = 128 | 19.11-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x NVIDIA T4, Gold 6240 2.6 GHz | TensorRT 6.0 | Batch Size = 128 | 19.10-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Latency

DGX-1: 1x NVIDIA V100-SXM2-16GB, E5-2698 v4 2.2 GHz | TensorRT 6.0 | Batch Size = 1 | 19.09-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x NVIDIA T4, Gold 6240 2.6 GHz | TensorRT 6.0 | Batch Size = 1 | 19.09-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Power Efficiency

DGX-1: 1x NVIDIA V100-SXM2-16GB, E5-2698 v4 2.2 GHz | TensorRT 6.0 | Batch Size = 128 | 19.11-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x NVIDIA T4, Gold 6240 2.6 GHz | TensorRT 6.0 | Batch Size = 128 | 19.10-py3 | Precision: INT8 | Dataset: Synthetic

 

Inference Performance

V100 Inference Performance

NetworkNetwork
Type
Batch
Size
Throughput Efficiency LatencyGPUServerContainerPrecisionDatasetFrameworkGPU Version
GoogleNetCNN11,608 images/sec14 images/sec/watt0.621x V100DGX-119.10-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN22,307 images/sec16 images/sec/watt0.871x V100DGX-219.09-py3INT8SyntheticTensorRT 6.0V100-SXM3-32GB
CNN85,368 images/sec35 images/sec/watt1.51x V100DGX-219.09-py3INT8SyntheticTensorRT 6.0V100-SXM3-32GB
CNN8312,135 images/sec43 images/sec/watt6.81x V100DGX-119.10-py3MixedSyntheticTensorRT 6.0V100-SXM2-16GB
CNN12812,753 images/sec46 images/sec/watt101x V100DGX-219.11-py3INT8SyntheticTensorRT 6.0V100-SXM3-32GB
MobileNet V1CNN14,605 images/sec34 images/sec/watt0.221x V100DGX-119.11-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN26,627 images/sec48 images/sec/watt0.31x V100DGX-119.11-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN815,204 images/sec94 images/sec/watt0.531x V100DGX-119.10-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN22031,871 images/sec93 images/sec/watt6.91x V100DGX-219.11-py3INT8SyntheticTensorRT 6.0V100-SXM3-32GB
CNN12829,927 images/sec103 images/sec/watt4.31x V100DGX-119.11-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
ResNet-50CNN11,171 images/sec7.5 images/sec/watt0.851x V100DGX-219.10-py3INT8SyntheticTensorRT 6.0V100-SXM3-32GB
CNN21,624 images/sec8.6 images/sec/watt1.21x V100DGX-219.11-py3INT8SyntheticTensorRT 6.0V100-SXM3-32GB
CNN83,386 images/sec19 images/sec/watt2.41x V100DGX-119.09-py3MixedSyntheticTensorRT 6.0V100-SXM2-16GB
CNN527,837 images/sec25 images/sec/watt6.61x V100DGX-219.09-py3MixedSyntheticTensorRT 6.0V100-SXM3-32GB
CNN1287,704 images/sec27 images/sec/watt171x V100DGX-119.09-py3MixedSyntheticTensorRT 6.0V100-SXM2-16GB
CNN1287,918 images/sec25 images/sec/watt161x V100DGX-219.10-py3MixedSyntheticTensorRT 6.0V100-SXM3-32GB
ResNet-50v1.5CNN1955 images/sec7.6 images/sec/watt1.11x V100DGX-119.09-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN21,396 images/sec9.4 images/sec/watt1.41x V100DGX-119.09-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN83,296 images/sec18 images/sec/watt2.41x V100DGX-119.11-py3MixedSyntheticTensorRT 6.0V100-SXM2-16GB
CNN517,375 images/sec23 images/sec/watt6.91x V100DGX-219.11-py3MixedSyntheticTensorRT 6.0V100-SXM3-32GB
CNN1287,315 images/sec25 images/sec/watt181x V100DGX-119.11-py3MixedSyntheticTensorRT 6.0V100-SXM2-16GB
CNN1287,535 images/sec22 images/sec/watt171x V100DGX-219.10-py3MixedSyntheticTensorRT 6.0V100-SXM3-32GB
VGG16CNN1907 images/sec4.3 images/sec/watt1.11x V100DGX-119.10-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN2977 images/sec4.3 images/sec/watt2.11x V100DGX-119.10-py3INT8SyntheticTensorRT 6.0V100-SXM2-16GB
CNN81,847 images/sec7.2 images/sec/watt4.31x V100DGX-119.10-py3MixedSyntheticTensorRT 6.0V100-SXM2-16GB
CNN162,385 images/sec7.7 images/sec/watt6.71x V100DGX-219.10-py3MixedSyntheticTensorRT 6.0V100-SXM3-32GB
CNN1282,954 images/sec8.5 images/sec/watt431x V100DGX-219.10-py3MixedSyntheticTensorRT 6.0V100-SXM3-32GB
NMTRNN14,013 total tokens/sec-131x V100DGX-1-Mixedwmt16-English-GermanTensorRT 5.1V100-SXM2-32GB
RNN26,290 total tokens/sec-161x V100DGX-1-Mixedwmt16-English-GermanTensorRT 5.1V100-SXM2-32GB
RNN6456,531 total tokens/sec-581x V100DGX-1-Mixedwmt16-English-GermanTensorRT 5.1V100-SXM2-32GB
RNN12873,375 total tokens/sec-891x V100DGX-1-Mixedwmt16-English-GermanTensorRT 5.1V100-SXM2-32GB
NCFRecommender1,048,57661,941,700 samples/sec258,677 samples/sec/watt0.021x V100DGX-119.10-py3MixedMovieLens 20 MillionPyTorch 1.3.0V100-SXM2-16GB
BERT-BASEAttention1557 sentences/sec10.3 sentences/sec/watt1.81x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
Attention2978 sentences/sec18.8 sentences/sec/watt2.01x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
Attention81,847 sentences/sec34.1 sentences/sec/watt4.31x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
Attention242,419 sentences/sec43.7 sentences/sec/watt9.91x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
Attention1282645 sentences/sec46 sentences/sec/watt48.41x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
BERT-LARGEAttention1239 sentences/sec4.3 sentences/sec/watt4.21x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
Attention2407 sentences/sec7.5 sentences/sec/watt4.91x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
Attention4562 sentences/sec10.6 sentences/sec/watt7.11x V101Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
Attention8636 sentences/sec11.8 sentences/sec/watt12.61x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB
Attention128823 sentences/sec13.6 sentences/sec/watt155.51x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1TensorRT 6.0V100-PCIE-16GB

NGC: TensorRT Container | Download TensorRT 6
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power

 

T4 Inference Performance

NetworkNetwork
Type
Batch
Size
Throughput Efficiency LatencyGPUServerContainerPrecisionDatasetFrameworkGPU
Version
GoogleNetCNN11,745 images/sec28 images/sec/watt0.581x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN22,479 images/sec40 images/sec/watt0.811x T4Supermicro SYS-4029GP-TRT T419.11-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN86,282 images/sec91 images/sec/watt1.31x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN608,812 images/sec126 images/sec/watt6.81x T4Supermicro SYS-4029GP-TRT T419.11-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN1289,013 images/sec129 images/sec/watt141x T4Supermicro SYS-4029GP-TRT T419.11-py3INT8SyntheticTensorRT 6.0NVIDIA T4
MobileNet V1CNN14,542 images/sec84 images/sec/watt0.221x T4Supermicro SYS-4029GP-TRT T419.10-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN28,124 images/sec131 images/sec/watt0.251x T4Supermicro SYS-4029GP-TRT T419.10-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN814,691 images/sec212 images/sec/watt0.541x T4Supermicro SYS-4029GP-TRT T419.10-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN11817,632 images/sec253 images/sec/watt6.71x T4Supermicro SYS-4029GP-TRT T419.11-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN12817,768 images/sec254 images/sec/watt7.21x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTensorRT 6.0NVIDIA T4
ResNet-50CNN11,167 images/sec17 images/sec/watt0.861x T4Supermicro SYS-4029GP-TRT T419.10-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN21,804 images/sec26 images/sec/watt1.11x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN84,060 images/sec58 images/sec/watt21x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN375,336 images/sec77 images/sec/watt6.91x T4Supermicro SYS-4029GP-TRT T419.10-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN1285,719 images/sec82 images/sec/watt221x T4Supermicro SYS-4029GP-TRT T419.10-py3INT8SyntheticTensorRT 6.0NVIDIA T4
ResNet-50v1.5CNN11,109 images/sec16 images/sec/watt0.91x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN21,775 images/sec25 images/sec/watt1.11x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN83,922 images/sec56 images/sec/watt21x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN345,037 images/sec72 images/sec/watt6.81x T4Supermicro SYS-4029GP-TRT T419.11-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN1285,403 images/sec77 images/sec/watt241x T4Supermicro SYS-4029GP-TRT T419.10-py3INT8SyntheticTensorRT 6.0NVIDIA T4
VGG16CNN1812 images/sec12 images/sec/watt1.21x T4Supermicro SYS-4029GP-TRT T419.10-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN21,091 images/sec16 images/sec/watt1.81x T4Supermicro SYS-4029GP-TRT T419.10-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN81,628 images/sec23 images/sec/watt4.91x T4Supermicro SYS-4029GP-TRT T419.10-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN121,710 images/sec24 images/sec/watt71x T4Supermicro SYS-4029GP-TRT T419.10-py3INT8SyntheticTensorRT 6.0NVIDIA T4
CNN1281,804 images/sec26 images/sec/watt711x T4Supermicro SYS-4029GP-TRT T419.11-py3INT8SyntheticTensorRT 6.0NVIDIA T4
NCFRecommender113,328 samples/sec350 samples/sec/watt0.081x T4Supermicro SYS-4029GP-TRT T419.11-py3INT8SyntheticTensorRT 6.0NVIDIA T4
Recommender641,014,353 samples/sec25,957 samples/sec/watt0.061x T4Supermicro SYS-4029GP-TRT T419.11-py3INT8SyntheticTensorRT 6.0NVIDIA T4
Recommender25,00051,366,158 samples/sec734,811 samples/sec/watt0.491x T4Supermicro SYS-4029GP-TRT T419.10-py3INT8SyntheticTensorRT 6.0NVIDIA T4
Recommender100,00055,348,959 samples/sec793,759 samples/sec/watt1.81x T4Supermicro SYS-4029GP-TRT T419.10-py3INT8SyntheticTensorRT 6.0NVIDIA T4
BERT-BASEAttention1484 sentences/sec11 sentences/sec/watt2.071x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1TensorRT 6.0NVIDIA T4
Attention2754 sentences/sec17 sentences/sec/watt2.651x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1TensorRT 6.0NVIDIA T4
Attention8827 sentences/sec20 sentences/sec/watt9.671x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1TensorRT 6.0NVIDIA T4
Attention128800 sentences/sec16 sentences/sec/watt160.021x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1TensorRT 6.0NVIDIA T4
BERT-LARGEAttention1171 sentences/sec4 sentences/sec/watt5.841x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1TensorRT 6.0NVIDIA T4
Attention2168 sentences/sec4 sentences/sec/watt11.881x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1TensorRT 6.0NVIDIA T4
Attention8244 sentences/sec6 sentences/sec/watt32.741x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1TensorRT 6.0NVIDIA T4
Attention128254 sentences/sec5 sentences/sec/watt504.331x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1TensorRT 6.0NVIDIA T4

NGC: TensorRT Container | Download TensorRT 6
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power

 

Last updated: Nov 26th, 2019