For HPC performance, please go here.


Deep Learning Training

NVIDIA’s complete solution stack, from GPUs to libraries, and containers on NVIDIA GPU Cloud (NGC), allows data scientists to quickly get up and running with deep learning. NVIDIA® Tesla® V100 Tensor Core GPUs leverage mixed precision to accelerate deep learning training throughputs across every framework and every type of neural network. NVIDIA captured all the top spots on 6 benchmarks submitted to MLPerf, the AI’s first industry-wide benchmark, a testament to our GPU-accelerated platform approach.

NVIDIA Performance on MLPerf AI Benchmarks

ResNet-50 Time to Solution on V100

MXNet  |  Batch Size refer to CNN V100 Training table below  |  18.11-py3  |  Precision: Mixed  |  Dataset: ImageNet2012  |  Convergence criteria - refer to MLPerf requirements

Training Image Classification on CNNs

ResNet-50 V1.5 Throughput on NVIDIA Tesla V100

DGX-2: 8x Tesla V100 32GB, Platinum 8168 2.7 GHz | Batch Size = 256 for MXNet, 512 for all others | 19.03-py3 | Precision: Mixed | Dataset: ImageNet2012

 
 

ResNet-50 V1.5 Throughput on NVIDIA Tesla T4

Supermicro SYS-4029GP-TRT T4: 8x Tesla T4, Gold 6140 2.3 GHz | Batch Size = 208 for MXnet, 256 for all others | 19.03-py3 | Precision: Mixed | Dataset: ImageNet2012

 

Training Performance

NVIDIA Performance on MLPerf AI Benchmarks

FrameWorkNetworkNetwork TypeTime to Solution GPUServerContainerPrecisionBatchSizeDataSetGPU Version
MXNetResNet-50 v1.5CNN135 minutes8x V100DGX-1V18.11-py3Mixed208ImageNet2012V100-SXM2-16GB
ResNet-50 v1.5CNN73.9 minutes16x V100DGX-218.11-py3Mixed256ImageNet2012V100-SXM3-32GB
ResNet-50 v1.5CNN70 minutes16x V100DGX-2H18.11-py3Mixed256ImageNet2012V100-SXM3-32GB
ResNet-50 v1.5CNN6.3 minutes640x V100DGX-1V Saturn18.11-py3Mixed26ImageNet2012V100-SXM2-16GB
ResNet-50 v1.5CNN7.4 minutes512x V100DGX-2H Pod18.11-py3Mixed32ImageNet2012V100-SXM3-32GB
PyTorchSSDCNN27 minutes8x V100DGX-1V18.11-py3Mixed152COCO2017V100-SXM2-16GB
SSDCNN15.9 minutes16x V100DGX-218.11-py3Mixed128COCO2017V100-SXM3-32GB
SSDCNN14.1 minutes16x V100DGX-2H18.11-py3Mixed128COCO2017V100-SXM3-32GB
SSDCNN6.5 minutes64x V100DGX-1V Saturn18.11-py3Mixed32COCO2017V100-SXM2-16GB
SSDCNN5.6 minutes64x V100DGX-2H Pod18.11-py3Mixed32COCO2017V100-SXM3-32GB
Mask R-CNNCNN323 minutes8x V100DGX-1V18.11-py3Mixed4COCO2014V100-SXM2-16GB
Mask R-CNNCNN176.3 minutes16x V100DGX-218.11-py3Mixed4COCO2014V100-SXM3-32GB
Mask R-CNNCNN166.9 minutes16x V100DGX-2H18.11-py3Mixed4COCO2014V100-SXM3-32GB
Mask R-CNNCNN81 minutes64x V100DGX-1V Saturn18.11-py3Mixed2COCO2014V100-SXM2-16GB
Mask R-CNNCNN72.1 minutes64x V100DGX-2H Pod18.11-py3Mixed2COCO2014V100-SXM3-32GB
NCFCNN0.47 minutes8x V100DGX-1V18.11-py3Mixed131072MovieLens 20 MillionV100-SXM2-16GB
NCFCNN0.4 minutes16x V100DGX-218.11-py3Mixed65536MovieLens 20 MillionV100-SXM3-32GB
NCFCNN0.4 minutes16x V100DGX-2H18.11-py3Mixed65536MovieLens 20 MillionV100-SXM3-32GB
PyTorchGNMTRNN18 minutes8x V100DGX-1V18.11-py3Mixed128WMT16 English-GermanV100-SXM2-16GB
GNMTRNN10.5 minutes16x V100DGX-218.11-py3Mixed64WMT16 English-GermanV100-SXM3-32GB
GNMTRNN9.8 minutes16x V100DGX-2H18.11-py3Mixed64WMT16 English-GermanV100-SXM3-32GB
GNMTRNN2.8 minutes256x V100DGX-1V Saturn18.11-py3Mixed32WMT16 English-GermanV100-SXM2-16GB
GNMTRNN2.7 minutes256x V100DGX-2H Pod18.11-py3Mixed32WMT16 English-GermanV100-SXM3-32GB
PyTorchTransformerAttention33 minutes8x V100DGX-1V18.11-py3Mixed5120WMT17 English-GermanV100-SXM2-16GB
TransformerAttention21.2 minutes16x V100DGX-218.11-py3Mixed10240WMT17 English-GermanV100-SXM3-32GB
TransformerAttention19.2 minutes16x V100DGX-2H18.11-py3Mixed10240WMT17 English-GermanV100-SXM3-32GB
TransformerAttention6.2 minutes192x V100DGX-1V Saturn18.11-py3Mixed2560WMT17 English-GermanV100-SXM2-16GB

V100 Training Performance

FrameWorkNetworkNetwork TypeThroughput GPU ServerContainerPrecisionBatchSizeDataSetGPU Version
MXNetInception V3CNN544 images/sec1x V100DGX-219.03-py3Mixed512ImageNet2012V100-SXM3-32GB
CNN4234 images/sec8x V100DGX-219.03-py3Mixed512ImageNet2012V100-SXM3-32GB
ResNet 50CNN1442 images/sec1x V100DGX-219.02-py3Mixed256ImageNet2012V100-SXM3-32GB
CNN10530 images/sec8x V100DGX-219.02-py3Mixed256ImageNet2012V100-SXM3-32GB
ResNet-50 V1.5CNN1456 images/sec1x V100DGX-219.03-py3Mixed256ImageNet2012V100-SXM3-32GB
CNN10925 images/sec8x V100DGX-219.03-py3Mixed256ImageNet2012V100-SXM3-32GB
PyTorchInception V3CNN572 images/sec1x V100DGX-219.03-py3Mixed512ImageNet2012V100-SXM3-32GB
CNN4156 images/sec8x V100DGX-119.03-py3Mixed512ImageNet2012V100-SXM2-32GB
NCFCNN20982955 samples/sec1x V100DGX-219.03-py3Mixed1048576ImageNet2012V100-SXM3-32GB
CNN97202281 samples/sec8x V100DGX-219.03-py3Mixed1048576ImageNet2012V100-SXM3-32GB
ResNet 50CNN849 images/sec1x V100DGX-118.10-py3Mixed256ImageNet2012V100-SXM2-32GB
CNN6675 images/sec8x V100DGX-118.10-py3Mixed256ImageNet2012V100-SXM2-32GB
ResNet-50 V1.5CNN837 images/sec1x V100DGX-219.03-py3Mixed512ImageNet2012V100-SXM3-32GB
CNN6175 images/sec8x V100DGX-219.03-py3Mixed512ImageNet2012V100-SXM3-32GB
SSDCNN231 images/sec1x V100DGX-219.03-py3Mixed64ImageNet2012V100-SXM3-32GB
CNN1642 images/sec8x V100DGX-219.03-py3Mixed64ImageNet2012V100-SXM3-32GB
TensorFlowGoogleNetCNN1856 images/sec1x V100DGX-219.03-py3Mixed1024ImageNet2012V100-SXM3-32GB
CNN13076 images/sec8x V100DGX-219.03-py3Mixed1536ImageNet2012V100-SXM3-32GB
Inception V3CNN574 images/sec1x V100DGX-219.03-py3Mixed384ImageNet2012V100-SXM3-32GB
CNN4012 images/sec8x V100DGX-219.03-py3Mixed384ImageNet2012V100-SXM3-32GB
NCFCNN14302190 samples/sec1x V100DGX-219.03-py3Mixed1048576ImageNet2012V100-SXM3-32GB
CNN59415774 samples/sec8x V100DGX-219.03-py3Mixed1048576ImageNet2012V100-SXM3-32GB
ResNet 50CNN921 images/sec1x V100DGX-219.02-py3Mixed512ImageNet2012V100-SXM3-32GB
CNN6505 images/sec8x V100DGX-219.02-py3Mixed512ImageNet2012V100-SXM3-32GB
ResNet-50 V1.5CNN848 images/sec1x V100DGX-219.03-py3Mixed512ImageNet2012V100-SXM3-32GB
CNN6015 images/sec8x V100DGX-219.3-py3Mixed512ImageNet2012V100-SXM3-32GB
PyTorchGNMT V2RNN84737 total tokens/sec1x V100DGX-219.03-py3Mixed512WMT16 English-GermanV100-SXM3-32GB
RNN598008 total tokens/sec8x V100DGX-219.03-py3Mixed512WMT16 English-GermanV100-SXM3-32GB

T4 Training Performance

FrameWorkNetworkNetwork TypeThroughput GPU ServerContainerPrecisionBatchSizeDataSetGPU Version
MXNetInception V3CNN173 images/sec1x T4Supermicro SYS-4029GP-TRT T419.03-py3Mixed208ImageNet2012Tesla T4
CNN1372 images/sec8x T4Supermicro SYS-4029GP-TRT T419.03-py3Mixed208ImageNet2012Tesla T4
ResNet 50CNN425 images/sec1x T4Supermicro SYS-4029GP-TRT T419.02-py3Mixed208ImageNet2012Tesla T4
CNN3322 images/sec8x T4Supermicro SYS-4029GP-TRT T419.02-py3Mixed208ImageNet2012Tesla T4
ResNet-50 V1.5CNN455 images/sec1x V100Supermicro SYS-4029GP-TRT T419.03-py3Mixed208ImageNet2012Tesla T4
CNN3590 images/sec8x V100Supermicro SYS-4029GP-TRT T419.03-py3Mixed208ImageNet2012Tesla T4
PyTorchInception V3CNN179 images/sec1x T4Supermicro SYS-4029GP-TRT T419.03-py3Mixed256ImageNet2012Tesla T4
CNN1405 images/sec8x T4Supermicro SYS-4029GP-TRT T419.03-py3Mixed256ImageNet2012Tesla T4
NCFCNN6622674 samples/sec1x V100Supermicro SYS-4029GP-TRT T419.03-py3Mixed1048576ImageNet2012Tesla T4
CNN25115319 samples/sec8x V100Supermicro SYS-4029GP-TRT T419.03-py3Mixed1048576ImageNet2012Tesla T4
ResNet 50CNN250 images/sec1x T4Supermicro SYS-4029GP-TRT T419.01-py3Mixed256ImageNet2012Tesla T4
CNN2054 images/sec8x T4Supermicro SYS-4029GP-TRT T419.01-py3Mixed256ImageNet2012Tesla T4
ResNet-50 V1.5CNN262 images/sec1x V100Supermicro SYS-4029GP-TRT T419.03-py3Mixed256ImageNet2012Tesla T4
CNN2082 images/sec8x V100Supermicro SYS-4029GP-TRT T419.03-py3Mixed256ImageNet2012Tesla T4
SSDCNN73 images/sec1x V100Supermicro SYS-4029GP-TRT T419.03-py3Mixed64ImageNet2012Tesla T4
CNN587 images/sec8x V100Supermicro SYS-4029GP-TRT T419.03-py3Mixed64ImageNet2012Tesla T4
TensorFlowGoogleNetCNN577 images/sec1x T4Supermicro SYS-4029GP-TRT T419.03-py3Mixed512ImageNet2012Tesla T4
CNN4482 images/sec8x T4Supermicro SYS-4029GP-TRT T419.03-py3Mixed512ImageNet2012Tesla T4
Inception V3CNN181 images/sec1x T4Supermicro SYS-4029GP-TRT T419.03-py3Mixed192ImageNet2012Tesla T4
CNN1368 images/sec8x T4Supermicro SYS-4029GP-TRT T419.03-py3Mixed192ImageNet2012Tesla T4
ResNet 50CNN294 images/sec1x T4Supermicro SYS-4029GP-TRT T419.02-py3Mixed256ImageNet2012Tesla T4
CNN2262 images/sec8x T4Supermicro SYS-4029GP-TRT T419.02-py3Mixed256ImageNet2012Tesla T4
ResNet-50 V1.5CNN274 images/sec1x V100Supermicro SYS-4029GP-TRT T419.03-py3Mixed256ImageNet2012Tesla T4
CNN2170 images/sec8x V100Supermicro SYS-4029GP-TRT T419.03-py3Mixed256ImageNet2012Tesla T4
PyTorchGNMT V2RNN25718 total tokens/sec1x T4Supermicro SYS-4029GP-TRT T419.03-py3Mixed256WMT16 English-GermanTesla T4
RNN183955 total tokens/sec8x T4Supermicro SYS-4029GP-TRT T419.03-py3Mixed256WMT16 English-GermanTesla T4

 

NVIDIA Deep Learning Inference Performance

NVIDIA® TensorRT™ running on NVIDIA GPUs enable the most efficient deep learning inference performance across multiple application areas and models. This versatility provides wide latitude to data scientists to create the optimal low-latency solution. Visit NVIDIA GPU Cloud (NGC) to download any of these containers.

NVIDIA Tesla® V100 Tensor Cores GPUs leverage mixed-precision to combine high throughput with low latencies across every type of neural network. Tesla P4 is an inference GPU, designed for optimal power consumption and latency, for ultra-efficient scale-out servers. Read the inference whitepaper to learn more about NVIDIA’s inference platform.

Measuring the inference performance involves balancing a lot of variables. PLASTER is an acronym that describes the key elements for measuring deep learning performance. Each letter identifies a factor (Programmability, Latency, Accuracy, Size of Model, Throughput, Energy Efficiency, Rate of Learning) that must be considered to arrive at the right set of tradeoffs and to produce a successful deep learning implementation. Refer to NVIDIA’s PLASTER whitepaper for more details.

Inference Image Classification on CNNs with TensorRT

ResNet-50 Throughput

DGX-2: 1x Tesla V100-SXM3-32GB, Platinum 8168 2.7 GHz | TensorRT 5.1 | Batch Size = 128 | 19.03-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x Tesla T4, Gold 6140 2.3 GHz | TensorRT 5.1 | Batch Size = 128 | 19.03-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 Latency

DGX-2: 1x Tesla V100-SXM3-32GB, Platinum 8168 2.7 GHz | TensorRT 5.1 | Batch Size = 1 | 19.03-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x Tesla T4, Gold 6140 2.3 GHz | TensorRT 5.1 | Batch Size = 1 | 19.03-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 Power Efficiency

DGX-2: 1x Tesla V100-SXM3-32GB, Platinum 8168 2.7 GHz | TensorRT 5.1 | Batch Size = 128 | 19.03-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x Tesla T4, Gold 6140 2.3 GHz | TensorRT 5.1 | Batch Size = 128 | 19.03-py3 | Precision: INT8 | Dataset: Synthetic

 

Inference Performance

V100 Inference Performance

NetworkNetwork TypeBatch
Size
Throughput Efficiency LatencyGPUServerContainerPrecisionDatasetGPU Version
GoogleNetCNN11719 images/sec13 images/sec/watt0.581x V100DGX-219.03-py3INT8SyntheticV100-SXM3-32GB
CNN22214 images/sec16 images/sec/watt0.91x V100DGX-219.03-py3INT8SyntheticV100-SXM3-32GB
CNN43210 images/sec22 images/sec/watt1.31x V100DGX-219.03-py3INT8SyntheticV100-SXM3-32GB
CNN85335 images/sec30 images/sec/watt1.51x V100DGX-219.03-py3INT8SyntheticV100-SXM3-32GB
CNN6411807 images/sec37 images/sec/watt5.41x V100DGX-219.03-py3INT8SyntheticV100-SXM3-32GB
CNN8412143 images/sec37 images/sec/watt6.91x V100DGX-219.03-py3INT8SyntheticV100-SXM3-32GB
CNN12812774 images/sec38 images/sec/watt101x V100DGX-219.03-py3INT8SyntheticV100-SXM3-32GB
MobileNet V2CNN11737 images/sec18 images/sec/watt0.581x V100DGX-1-INT8SyntheticV100-SXM2-16GB
CNN22886 images/sec30 images/sec/watt0.691x V100DGX-1-INT8SyntheticV100-SXM2-16GB
CNN45432 images/sec48 images/sec/watt0.741x V100DGX-1-INT8SyntheticV100-SXM2-16GB
CNN88848 images/sec63 images/sec/watt0.91x V100DGX-1-INT8SyntheticV100-SXM2-16GB
CNN3217649 images/sec77 images/sec/watt1.81x V100DGX-1-INT8SyntheticV100-SXM2-16GB
CNN6421167 images/sec81 images/sec/watt31x V100DGX-1-INT8SyntheticV100-SXM2-16GB
CNN12823262 images/sec81 images/sec/watt5.51x V100DGX-1-INT8SyntheticV100-SXM2-16GB
ResNet-50CNN11165 images/sec7.2 images/sec/watt0.861x V100DGX-219.03-py3INT8SyntheticV100-SXM3-32GB
CNN21608 images/sec9.2 images/sec/watt1.21x V100DGX-219.03-py3INT8SyntheticV100-SXM3-32GB
CNN42028 images/sec14.0 images/sec/watt21x V100DGX-219.03-py3MixedSyntheticV100-SXM3-32GB
CNN83459 images/sec18.0 images/sec/watt2.31x V100DGX-219.03-py3MixedSyntheticV100-SXM3-32GB
CNN396043 images/sec18.0 images/sec/watt6.51x V100DGX-219.03-py3INT8SyntheticV100-SXM3-32GB
CNN647064 images/sec22.0 images/sec/watt9.11x V100DGX-219.03-py3MixedSyntheticV100-SXM3-32GB
CNN1287844 images/sec23.0 images/sec/watt161x V100DGX-219.03-py3MixedSyntheticV100-SXM3-32GB
VGG16CNN1743 images/sec3.1 images/sec/watt1.41x V100DGX-219.02-py3MixedSyntheticV100-SXM3-32GB
CNN21201 images/sec4.5 images/sec/watt1.71x V100DGX-219.02-py3MixedSyntheticV100-SXM3-32GB
CNN41662 images/sec5.9 images/sec/watt2.41x V100DGX-219.02-py3MixedSyntheticV100-SXM3-32GB
CNN82138 images/sec6.4 images/sec/watt3.71x V100DGX-219.02-py3MixedSyntheticV100-SXM3-32GB
CNN322680 images/sec7.9 images/sec/watt121x V100DGX-219.02-py3MixedSyntheticV100-SXM3-32GB
CNN642748 images/sec8.2 images/sec/watt231x V100DGX-219.02-py3MixedSyntheticV100-SXM3-32GB
CNN1282795 images/sec8.2 images/sec/watt461x V100DGX-219.02-py3MixedSyntheticV100-SXM3-32GB

TensorRT 5.1, except TensorRT 5.0 for MobileNet V2 and VGG16

 

T4 Inference Performance

NetworkNetwork TypeBatch
Size
Throughput Efficiency LatencyGPUServerContainerPrecisionDatasetGPU Version
GoogleNetCNN11735 images/sec27 images/sec/watt0.581x T4Supermicro SYS-4029GP-TRT T419.03-py3INT8SyntheticTesla T4
CNN22402 images/sec37 images/sec/watt0.831x T4Supermicro SYS-4029GP-TRT T419.03-py3INT8SyntheticTesla T4
CNN44049 images/sec62 images/sec/watt0.991x T4Supermicro SYS-4029GP-TRT T419.03-py3INT8SyntheticTesla T4
CNN85684 images/sec82 images/sec/watt1.41x T4Supermicro SYS-4029GP-TRT T419.03-py3INT8SyntheticTesla T4
CNN327302 images/sec105 images/sec/watt4.41x T4Supermicro SYS-4029GP-TRT T419.03-py3INT8SyntheticTesla T4
CNN647435 images/sec107 images/sec/watt8.61x T4Supermicro SYS-4029GP-TRT T419.03-py3INT8SyntheticTesla T4
CNN1287546 images/sec108 images/sec/watt171x T4Supermicro SYS-4029GP-TRT T419.03-py3INT8SyntheticTesla T4
MobileNet V2CNN11766 images/sec34 images/sec/watt0.571x T4Supermicro SYS-4029GP-TRT T4-INT8SyntheticTesla T4
CNN23235 images/sec55 images/sec/watt0.621x T4Supermicro SYS-4029GP-TRT T4-INT8SyntheticTesla T4
CNN45262 images/sec79 images/sec/watt0.761x T4Supermicro SYS-4029GP-TRT T4-INT8SyntheticTesla T4
CNN87251 images/sec106 images/sec/watt1.11x T4Supermicro SYS-4029GP-TRT T4-INT8SyntheticTesla T4
CNN328869 images/sec129 images/sec/watt3.61x T4Supermicro SYS-4029GP-TRT T4-INT8SyntheticTesla T4
CNN649003 images/sec131 images/sec/watt7.11x T4Supermicro SYS-4029GP-TRT T4-INT8SyntheticTesla T4
CNN1289059 images/sec131 images/sec/watt141x T4Supermicro SYS-4029GP-TRT T4-INT8SyntheticTesla T4
ResNet-50CNN11087 images/sec16 images/sec/watt0.91x T4Supermicro SYS-4029GP-TRT T419.03-py3INT8SyntheticTesla T4
CNN21735 images/sec26 images/sec/watt1.21x T4Supermicro SYS-4029GP-TRT T419.03-py3INT8SyntheticTesla T4
CNN42887 images/sec42 images/sec/watt1.41x T4Supermicro SYS-4029GP-TRT T419.03-py3INT8SyntheticTesla T4
CNN83766 images/sec54 images/sec/watt2.11x T4Supermicro SYS-4029GP-TRT T419.03-py3INT8SyntheticTesla T4
CNN324677 images/sec67 images/sec/watt6.81x T4Supermicro SYS-4029GP-TRT T419.03-py3INT8SyntheticTesla T4
CNN644815 images/sec69 images/sec/watt131x T4Supermicro SYS-4029GP-TRT T419.03-py3INT8SyntheticTesla T4
CNN1284944 images/sec71 images/sec/watt261x T4Supermicro SYS-4029GP-TRT T419.03-py3INT8SyntheticTesla T4
VGG16CNN1287 images/sec4.2 images/sec/watt3.51x T4Supermicro SYS-4029GP-TRT T419.02-py3INT8SyntheticTesla T4
CNN2487 images/sec7.0 images/sec/watt4.11x T4Supermicro SYS-4029GP-TRT T419.02-py3INT8SyntheticTesla T4
CNN4812 images/sec12.0 images/sec/watt4.91x T4Supermicro SYS-4029GP-TRT T419.02-py3INT8SyntheticTesla T4
CNN81236 images/sec18.0 images/sec/watt6.51x T4Supermicro SYS-4029GP-TRT T419.02-py3INT8SyntheticTesla T4
CNN321598 images/sec23.0 images/sec/watt201x T4Supermicro SYS-4029GP-TRT T419.02-py3INT8SyntheticTesla T4
CNN641594 images/sec23.0 images/sec/watt401x T4Supermicro SYS-4029GP-TRT T419.02-py3INT8SyntheticTesla T4
CNN1281716 images/sec25.0 images/sec/watt751x T4Supermicro SYS-4029GP-TRT T419.02-py3INT8SyntheticTesla T4

TensorRT 5.1, except TensorRT 5.0 for MobileNet V2 and VGG16

 

Last updated: April 9th, 2019