Deep Learning Training

NVIDIA’s complete solution stack, from GPUs to libraries, and containers on NVIDIA GPU Cloud (NGC), allows data scientists to quickly get up and running with deep learning. NVIDIA® Tesla® V100 Tensor Core GPUs leverage mixed precision to accelerate deep learning training throughputs across every framework and every type of neural network. NVIDIA captured all the top spots on 6 benchmarks submitted to MLPerf, the AI’s first industry-wide benchmark, a testament to our GPU-accelerated platform approach.

NVIDIA Performance on MLPerf AI Benchmarks

ResNet-50 Time to Solution on V100

MXNet  |  Batch Size refer to CNN V100 Training table below  |  18.11-py3  |  Precision: Mixed  |  Dataset: ImageNet2012  |  Convergence criteria - refer to MLPerf requirements

Training Image Classification on CNNs

ResNet-50 Throughput on NVIDIA Tesla V100

DGX-1: 8x Tesla V100 32GB, E5-2698 v4 2.2 GHz | Batch Size = 256 for PyTorch, 512 for all others | PyTorch uses 18.10_py3, all others use 18.09_py3 | Precision: Mixed | Dataset: ImageNet2012DGX-1

 
 

ResNet-50 Throughput on NVIDIA Tesla T4

DGX-1: 8x Tesla T4, Gold 6140 2.3 GHz | Batch Size = 96 for MXnet, 256 for all others | 18.11-py3 | Precision: Mixed | Dataset: ImageNet2012

 

Training Performance

NVIDIA Performance on MLPerf AI Benchmarks

FrameWorkNetworkNetwork TypeTime to Solution GPUServerContainerPrecisionBatchSizeDataSetGPU Version
MXNetResnet-50 v1.5CNN135 minutes8x V100DGX-1V18.11-py3Mixed208ImageNet2012V100-SXM2-16GB
Resnet-50 v1.5CNN73.9 minutes16x V100DGX-218.11-py3Mixed256ImageNet2012V100-SXM3-32GB
Resnet-50 v1.5CNN70 minutes16x V100DGX-2H18.11-py3Mixed256ImageNet2012V100-SXM3-32GB
Resnet-50 v1.5CNN6.3 minutes640x V100DGX-1V Saturn18.11-py3Mixed26ImageNet2012V100-SXM2-16GB
Resnet-50 v1.5CNN7.4 minutes512x V100DGX-2H System18.11-py3Mixed32ImageNet2012V100-SXM3-32GB
PyTorchSSDCNN27 minutes8x V100DGX-1V18.11-py3Mixed152COCO2017V100-SXM2-16GB
SSDCNN15.9 minutes16x V100DGX-218.11-py3Mixed128COCO2017V100-SXM3-32GB
SSDCNN14.1 minutes16x V100DGX-2H18.11-py3Mixed128COCO2017V100-SXM3-32GB
SSDCNN6.5 minutes64x V100DGX-1V Saturn18.11-py3Mixed32COCO2017V100-SXM2-16GB
SSDCNN5.6 minutes64x V100DGX-2H System18.11-py3Mixed32COCO2017V100-SXM3-32GB
Mask R-CNNCNN323 minutes8x V100DGX-1V18.11-py3Mixed4COCO2014V100-SXM2-16GB
Mask R-CNNCNN176.3 minutes16x V100DGX-218.11-py3Mixed4COCO2014V100-SXM3-32GB
Mask R-CNNCNN166.9 minutes16x V100DGX-2H18.11-py3Mixed4COCO2014V100-SXM3-32GB
Mask R-CNNCNN81 minutes64x V100DGX-1V Saturn18.11-py3Mixed2COCO2014V100-SXM2-16GB
Mask R-CNNCNN72.1 minutes64x V100DGX-2H System18.11-py3Mixed2COCO2014V100-SXM3-32GB
NCFCNN0.47 minutes8x V100DGX-1V18.11-py3Mixed131072MovieLens 20 MillionV100-SXM2-16GB
NCFCNN0.4 minutes16x V100DGX-218.11-py3Mixed65536MovieLens 20 MillionV100-SXM3-32GB
NCFCNN0.4 minutes16x V100DGX-2H18.11-py3Mixed65536MovieLens 20 MillionV100-SXM3-32GB
PyTorchGNMTRNN18 minutes8x V100DGX-1V18.11-py3Mixed128WMT16 English-GermanV100-SXM2-16GB
GNMTRNN10.5 minutes16x V100DGX-218.11-py3Mixed64WMT16 English-GermanV100-SXM3-32GB
GNMTRNN9.8 minutes16x V100DGX-2H18.11-py3Mixed64WMT16 English-GermanV100-SXM3-32GB
GNMTRNN2.8 minutes256x V100DGX-1V Saturn18.11-py3Mixed32WMT16 English-GermanV100-SXM2-16GB
GNMTRNN2.7 minutes256x V100DGX-2H System18.11-py3Mixed32WMT16 English-GermanV100-SXM3-32GB
PyTorchTransformerAttention33 minutes8x V100DGX-1V18.11-py3Mixed5120WMT17 English-GermanV100-SXM2-16GB
TransformerAttention21.2 minutes16x V100DGX-218.11-py3Mixed10240WMT17 English-GermanV100-SXM3-32GB
TransformerAttention19.2 minutes16x V100DGX-2H18.11-py3Mixed10240WMT17 English-GermanV100-SXM3-32GB
TransformerAttention6.2 minutes192x V100DGX-1V Saturn18.11-py3Mixed2560WMT17 English-GermanV100-SXM2-16GB

V100 Training Performance

FrameWorkNetworkNetwork TypeThroughput GPU ServerContainerPrecisionBatchSizeDataSetGPU Version
MXNetInception V3CNN506 images/sec1x V100DGX-118.08-py3Mixed256ImageNet2012V100-SXM2-32GB
CNN3818 images/sec8x V100DGX-118.08-py3Mixed512ImageNet2012V100-SXM2-32GB
Resnet-50CNN1232 images/sec1x V100DGX-118.09-py3Mixed512ImageNet2012V100-SXM2-32GB
CNN9598 images/sec8x V100DGX-118.09-py3Mixed512ImageNet2012V100-SXM2-32GB
PyTorchInception V3CNN509 images/sec1x V100DGX-118.11-py3Mixed384ImageNet2012V100-SXM2-32GB
CNN3909 images/sec8x V100DGX-118.10-py3Mixed384ImageNet2012V100-SXM2-32GB
Resnet-50CNN875 images/sec1x V100DGX-118.08-py3Mixed512ImageNet2012V100-SXM2-32GB
CNN6675 images/sec8x V100DGX-118.10-py3Mixed256ImageNet2012V100-SXM2-32GB
TensorFlowGoogleNetCNN1738 images/sec1x V100DGX-118.11-py3Mixed1024ImageNet2012V100-SXM2-32GB
CNN11645 images/sec8x V100DGX-118.11-py3Mixed1024ImageNet2012V100-SXM2-32GB
Inception V3CNN519 images/sec1x V100DGX-118.11-py3Mixed384ImageNet2012V100-SXM2-32GB
CNN3971 images/sec8x V100DGX-118.11-py3Mixed384ImageNet2012V100-SXM2-32GB
CNN870 images/sec1x V100DGX-118.09-py3Mixed512ImageNet2012V100-SXM2-32GB
CNN6720 images/sec8x V100DGX-118.09-py3Mixed512ImageNet2012V100-SXM2-32GB
MXNetSockeye
(OpenNMT)
RNN36088 total tokens/sec1x V100DGX-118.08-py3FP32512WMT15 German-EnglishV100-SXM2-32GB
RNN250396 total tokens/sec8x V100DGX-118.07-py3FP32512WMT15 German-EnglishV100-SXM2-32GB
PyTorchGNMTRNN64763 total tokens/sec1x V100DGX-118.09-py3Mixed512WMT16 English-GermanV100-SXM2-32GB
RNN503936 total tokens/sec8x V100DGX-118.09-py3Mixed384WMT16 English-GermanV100-SXM2-32GB

T4 Training Performance

FrameWorkNetworkNetwork TypeThroughput GPU ServerContainerPrecisionBatchSizeDataSetGPU Version
MXNetInception V3CNN160 images/sec1x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3Mixed192ImageNet2012Tesla T4
CNN1130 images/sec8x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3Mixed192ImageNet2012Tesla T4
Resnet50CNN416 images/sec1x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3Mixed96ImageNet2012Tesla T4
CNN3212 images/sec8x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3Mixed96ImageNet2012Tesla T4
PyTorchInception V3CNN164 images/sec1x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3Mixed192ImageNet2012Tesla T4
CNN1282 images/sec8x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3Mixed192ImageNet2012Tesla T4
Resnet50CNN278 images/sec1x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3Mixed256ImageNet2012Tesla T4
CNN2179 images/sec8x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3Mixed256ImageNet2012Tesla T4
TensorFlowGoogleNetCNN554 images/sec1x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3Mixed512ImageNet2012Tesla T4
CNN4269 images/sec8x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3Mixed512ImageNet2012Tesla T4
Inception V3CNN170 images/sec1x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3Mixed192ImageNet2012Tesla T4
CNN1026 images/sec8x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3Mixed128ImageNet2012Tesla T4
Resnet50CNN289 images/sec1x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3Mixed256ImageNet2012Tesla T4
CNN2205 images/sec8x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3Mixed256ImageNet2012Tesla T4
MXnetOpenNMTRNN13650 total tokens/sec1x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3FP3216384WMT15 German-EnglishTesla T4
RNN97003 total tokens/sec8x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3FP3216384WMT15 German-EnglishTesla T4
PyTorchGNMTRNN20139 total tokens/sec1x T4Supermicro
SYS-4029GP-TRT T4
18.09-py3Mixed192WMT16 English-GermanTesla T4
RNN50925 total tokens/sec8x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3FP32128WMT16 English-GermanTesla T4

 

NVIDIA Deep Learning Inference Performance

NVIDIA® TensorRT™ running on NVIDIA GPUs enable the most efficient deep learning inference performance across multiple application areas and models. This versatility provides wide latitude to data scientists to create the optimal low-latency solution. Visit NVIDIA GPU Cloud (NGC) to download any of these containers.

NVIDIA Tesla® V100 Tensor Cores GPUs leverage mixed-precision to combine high throughput with low latencies across every type of neural network. Tesla P4 is an inference GPU, designed for optimal power consumption and latency, for ultra-efficient scale-out servers. Read the inference whitepaper to learn more about NVIDIA’s inference platform.

Measuring the inference performance involves balancing a lot of variables. PLASTER is an acronym that describes the key elements for measuring deep learning performance. Each letter identifies a factor (Programmability, Latency, Accuracy, Size of Model, Throughput, Energy Efficiency, Rate of Learning) that must be considered to arrive at the right set of tradeoffs and to produce a successful deep learning implementation. Refer to NVIDIA’s PLASTER whitepaper for more details.

Inference Image Classification on CNNs with TensorRT

ResNet-50 Throughput

DGX-1: 1x Tesla V100-SXM2-16GB, E5-2698 v4 2.2 GHz | TensorRT 5.0 | Batch Size = 128 | 18.11-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x Tesla T4, Gold 6140 2.3 GHz | TensorRT 5.0 | Batch Size = 128 | 18.10-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 Latency

DGX-1: 1x Tesla V100-SXM2-16GB, E5-2698 v4 2.2 GHz | TensorRT 5.0 | Batch Size = 1 | 18.09-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x Tesla T4, Gold 6140 2.3 GHz | TensorRT 5.0 | Batch Size = 1 | 18.10-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 Power Efficiency

DGX-1: 1x Tesla V100-SXM2-16GB, E5-2698 v4 2.2 GHz | TensorRT 5.0 | Batch Size = 128 | 18.11-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x Tesla T4, Gold 6140 2.3 GHz | TensorRT 5.0 | Batch Size = 128 | 18.10-py3 | Precision: INT8 | Dataset: Synthetic

 

Inference Performance

V100 Inference Performance

NetworkNetwork TypeBatch
Size
Throughput Efficiency LatencyGPUServerContainerPrecisionDatasetGPU Version
Resnet-50CNN11143 images/sec8.6 images/sec/watt0.881x V100DGX-118.09-py3INT8SyntheticV100-SXM2-16GB
CNN21544 images/sec11 images/sec/watt1.31x V100DGX-118.11-py3INT8SyntheticV100-SXM2-16GB
CNN41989 images/sec12 images/sec/watt21x V100DGX-118.11-py3INT8SyntheticV100-SXM2-16GB
CNN83247 images/sec18 images/sec/watt2.51x V100DGX-118.10-py3INT8SyntheticV100-SXM2-16GB
CNN415925 images/sec21 images/sec/watt6.91x V100DGX-118.10-py3MixedSyntheticV100-SXM2-16GB
CNN646506 images/sec24 images/sec/watt9.81x V100DGX-118.11-py3MixedSyntheticV100-SXM2-16GB
CNN1286954 images/sec24 images/sec/watt181x V100DGX-118.11-py3MixedSyntheticV100-SXM2-16GB
GoogleNetCNN11622 images/sec15 images/sec/watt0.621x V100DGX-118.09-py3INT8SyntheticV100-SXM2-16GB
CNN22142 images/sec18 images/sec/watt0.931x V100DGX-118.10-py3INT8SyntheticV100-SXM2-16GB
CNN43174 images/sec27 images/sec/watt1.31x V100DGX-118.11-py3INT8SyntheticV100-SXM2-16GB
CNN85262 images/sec34 images/sec/watt1.51x V100DGX-118.11-py3INT8SyntheticV100-SXM2-32GB
CNN6411454 images/sec42 images/sec/watt5.61x V100DGX-118.11-py3INT8SyntheticV100-SXM2-16GB
CNN8211916 images/sec41 images/sec/watt6.91x V100DGX-118.09-py3INT8SyntheticV100-SXM2-16GB
CNN12812350 images/sec42 images/sec/watt101x V100DGX-118.10-py3INT8SyntheticV100-SXM2-16GB
NMTRNN13961 tokens/sec32 tokens/sec/watt131x V100DGX-118.09-py3MixedWMT16
English-German
V100-SXM2-16GB
RNN25711 tokens/sec48 tokens/sec/watt181x V100DGX-118.09-py3MixedWMT16
English-German
V100-SXM2-16GB
RNN49380 tokens/sec79 tokens/sec/watt221x V100DGX-118.09-py3MixedWMT16
English-German
V100-SXM2-16GB
RNN814452 tokens/sec119 tokens/sec/watt281x V100DGX-118.09-py3MixedWMT16
English-German
V100-SXM2-16GB
RNN6455497 tokens/sec487 tokens/sec/watt591x V100DGX-118.09-py3MixedWMT16
English-German
V100-SXM2-16GB
RNN12875217 tokens/sec1068 tokens/sec/watt871x V100DGX-118.09-py3MixedWMT16
English-German
V100-SXM2-16GB

 

T4 Inference Performance

NetworkNetwork TypeBatch
Size
Throughput Efficiency LatencyGPUServerContainerPrecisionDatasetGPU Version
Resnet-50CNN1961 images/sec14 images/sec/watt11x T4Supermicro
SYS-4029GP-TRT T4
18.10-py3INT8SyntheticTesla T4
CNN21404 images/sec21 images/sec/watt1.41x T4Supermicro
SYS-4029GP-TRT T4
18.10-py3INT8SyntheticTesla T4
CNN42590 images/sec38 images/sec/watt1.51x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3INT8SyntheticTesla T4
CNN83427 images/sec50 images/sec/watt2.31x T4Supermicro
SYS-4029GP-TRT T4
18.10-py3INT8SyntheticTesla T4
CNN163888 images/sec56 images/sec/watt4.11x T4Supermicro
SYS-4029GP-TRT T4
18.10-py3INT8SyntheticTesla T4
CNN294092 images/sec60 images/sec/watt7.11x T4Supermicro
SYS-4029GP-TRT T4
18.10-py3INT8SyntheticTesla T4
CNN1284198 images/sec60 images/sec/watt311x T4Supermicro
SYS-4029GP-TRT T4
18.10-py3INT8SyntheticTesla T4
GoogleNetCNN11571 images/sec23 images/sec/watt0.641x T4Supermicro
SYS-4029GP-TRT T4
18.10-py3INT8SyntheticTesla T4
CNN22124 images/sec31 images/sec/watt0.941x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3INT8SyntheticTesla T4
CNN43533 images/sec52 images/sec/watt1.11x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3INT8SyntheticTesla T4
CNN84738 images/sec69 images/sec/watt1.71x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3INT8SyntheticTesla T4
CNN405746 images/sec83 images/sec/watt71x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3INT8SyntheticTesla T4
CNN645860 images/sec85 images/sec/watt111x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3INT8SyntheticTesla T4
CNN1286290 images/sec91 images/sec/watt201x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3INT8SyntheticTesla T4
NMTRNN12505 tokens/sec38 tokens/sec/watt211x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3MixedWMT16
English-German
Tesla T4
RNN23737 tokens/sec57 tokens/sec/watt281x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3MixedWMT16
English-German
Tesla T4
RNN46168 tokens/sec97 tokens/sec/watt331x T4Supermicro
SYS-4029GP-TRT T4
18.10-py3MixedWMT16
English-German
Tesla T4
RNN810189 tokens/sec159 tokens/sec/watt401x T4Supermicro
SYS-4029GP-TRT T4
18.11-py3MixedWMT16
English-German
Tesla T4
RNN6429012 tokens/sec646 tokens/sec/watt1131x T4Supermicro
SYS-4029GP-TRT T4
18.10-py3MixedWMT16
English-German
Tesla T4
RNN12834435 tokens/sec628 tokens/sec/watt1911x T4Supermicro
SYS-4029GP-TRT T4
18.10-py3MixedWMT16
English-German
Tesla T4

 

Last updated: Dec 12, 2018