Deep Learning Training

NVIDIA’s complete solution stack, from GPUs to libraries, and containers on NVIDIA GPU Cloud (NGC), allows data scientists to quickly get up and running with deep learning. NVIDIA® Tesla® V100 Tensor Core GPUs leverage mixed precision to accelerate deep learning training throughputs across every framework and every type of neural network.

Training Image Classification on CNNs

ResNet-50 Throughput on NVIDIA Tesla V100

8x Tesla V100 32GB, E5-2698 v4 2.2 GHz  |  Mixed Precision  |  PyTorch uses 18.10-py3, all others use 18.09-py3  |  batch size = 256 for PyTorch, 512 for all others  |  ImageNet2012

 
 

ResNet-50 Throughput on NVIDIA Tesla T4

4x Tesla T4, Gold 6140 2.3 GHz  |  Mixed Precision  |  18.10-py3  |  batch size = 128  |  ImageNet2012

 

CNN Training Performance

GPU: TESLA V100
FRAMEWORKNETWORKTHROUGHPUTMETRICBATCH SIZEGPUCONTAINERPRECISION
MXnetInception V33,818images/sec5128x V10018.08-py3Mixed
Resnet509,598images/sec5128x V10018.09-py3Mixed
PyTorchInception V33,909images/sec3848x V10018.10-py3Mixed
Resnet506,675images/sec2568x V10018.10-py3Mixed
TensorFlowGoogleNet11,476images/sec1,0248x V10018.09-py3Mixed
Inception V33,965images/sec3848x V10018.10-py3Mixed
Resnet506,720images/sec5128x V10018.09-py3Mixed

GPU Memory: 32 GB  |  CPU: E5-2698 v4 2.2 GHz  |  ImageNet2012

GPU: TESLA T4
FRAMEWORKNETWORKTHROUGHPUTMETRICBATCH SIZEGPUCONTAINERPRECISION
MXnetInception V3528images/sec1924x T418.10-py3Mixed
Resnet501,564images/sec1284x T418.10-py3Mixed
PyTorchInception V3294images/sec644x T418.10-py3Mixed
Resnet50439images/sec1284x T418.10-py3Mixed
TensorFlowGoogleNet1,208images/sec2564x T418.10-py3Mixed
Inception V3296images/sec644x T418.10-py3Mixed
Resnet50456images/sec1284x T418.10-py3Mixed

CPU: Gold 6140 2.3 GHz  |  ImageNet2012

 

RNN Training Performance

GPU: TESLA V100
FRAMEWORKNETWORKTHROUGHPUTMETRICBATCH SIZEGPUCONTAINERDATASET
MXnetSockeye (OpenNMT)250,396total tokens/sec5128x V10018.07-py3WMT15 German-English
PyTorchGNMT503,936total tokens/sec3848x V10018.09-py3WMT16 English-German

GPU Memory: 32 GB  |  CPU: E5-2698 v4 2.2 GHz  |  MXnet uses FP32, PyTorch uses mixed precision

GPU: TESLA T4
FRAMEWORKNETWORKTHROUGHPUTMETRICBATCH SIZEGPUCONTAINERDATASET
MXnetOpenNMT49,884total tokens/sec16,3844x T418.10-py3WMT15 German-English
PyTorchGNMT24,118total tokens/sec1284x T418.10-py3WMT16 English-German

CPU: Gold 6140 2.3 GHz  |  FP32 precision

 

NVIDIA Deep Learning Inference Performance

NVIDIA® TensorRT™ running on NVIDIA GPUs enable the most efficient deep learning inference performance across multiple application areas and models. This versatility provides wide latitude to data scientists to create the optimal low-latency solution. Visit NVIDIA GPU Cloud (NGC) to download any of these containers.

NVIDIA Tesla® V100 Tensor Cores GPUs leverage mixed-precision to combine high throughput with low latencies across every type of neural network. Tesla P4 is an inference GPU, designed for optimal power consumption and latency, for ultra-efficient scale-out servers. Read the inference whitepaper to learn more about NVIDIA’s inference platform.

Measuring the inference performance involves balancing a lot of variables. PLASTER is an acronym that describes the key elements for measuring deep learning performance. Each letter identifies a factor (Programmability, Latency, Accuracy, Size of Model, Throughput, Energy Efficiency, Rate of Learning) that must be considered to arrive at the right set of tradeoffs and to produce a successful deep learning implementation. Refer to NVIDIA’s PLASTER whitepaper for more details.

Inference Image Classification on CNNs with TensorRT

ResNet-50 Throughput

1x Tesla V100-SXM2-16GB, E5-2698 v4 2.2 GHz  |  18.09-py3  |  TensorRT 5.0  |  Mixed precision, batch size = 128  |  Synthetic dataset
1x Tesla T4, Gold 6140 2.3 GHz  |  18.10-py3  |  TensorRT 5.0  |  INT8 precision, batch size = 128  |  Synthetic dataset

 
 

ResNet-50 Latency

1x Tesla V100-SXM2-16GB, E5-2698 v4 2.2 GHz  |  18.09-py3  |  TensorRT 5.0  |  INT8 precision, batch size = 1  |  Synthetic dataset
1x Tesla T4, Gold 6140 2.3 GHz  |  18.10-py3  |  TensorRT 5.0  |  INT8 precision, batch size = 1  |  Synthetic dataset

 
 

ResNet-50 Power Efficiency

1x Tesla V100-SXM2-16GB, E5-2698 v4 2.2 GHz  |  18.09-py3  |  TensorRT 5.0  |  Mixed precision, batch size = 128  |  Synthetic dataset
1x Tesla T4, Gold 6140 2.3 GHz  |  18.10-py3  |  TensorRT 5.0  |  INT8 precision, batch size = 128  |  Synthetic dataset

 

 

CNN Inference Performance

GPU: TESLA V100
NETWORKBATCH SIZETHROUGHPUT
(images/sec)
EFFICIENCY
(images/sec/watt)
LATENCY
(ms)
PRECISIONCONTAINER
Resnet5011,1438.60.88INT818.09-py3
21,538111.3INT818.10-py3
41,975122INT818.10-py3
83,247182.5INT818.10-py3
415,925216.9Mixed18.10-py3
646,0502111INT818.10-py3
1286,3592220Mixed18.09-py3
GoogleNet11,622150.62INT818.09-py3
22,142180.93INT818.10-py3
43,110261.3INT818.09-py3
85,239351.5INT818.09-py3
6411,453415.6INT818.10-py3
8211,916416.9INT818.09-py3
12812,3504210INT818.10-py3

GPU Memory: 16 GB  |  CPU: E5-2698 v4 2.2 GHz  |  TensorRT 5.0  |  Synthetic dataset

 

CNN Inference Performance

GPU: TESLA T4
NETWORKBATCH SIZETHROUGHPUT
(images/sec)
EFFICIENCY
(images/sec/watt)
LATENCY
(ms)
PRECISIONCONTAINER
Resnet501961141.0INT818.10-py3
21,404211.4INT818.10-py3
42,553371.6INT818.10-py3
83,427502.3INT818.10-py3
163,888564.1INT818.10-py3
294,092607.1INT818.10-py3
1284,1986031INT818.10-py3
GoogleNet11,571230.64INT818.10-py3
22,026290.99INT818.10-py3
43,421501.2INT8Pre-Release
83,507512.3INT818.10-py3
325,755855.6INT8Pre-Release
645,9668811INT8Pre-Release
1285,9898721INT8Pre-Release

CPU: Gold 6140 2.3 GHz  |  TensorRT 5.0  |  Synthetic dataset

 

RNN Inference Performance

GPU: TESLA V100
NETWORKBATCH SIZETHROUGHPUT
(total tokens/sec)
EFFICIENCY
(tokens/sec/watt)
LATENCY
(ms)
PRECISIONCONTAINER
NMT13,9613213Mixed18.09-py3
25,7114818Mixed18.09-py3
49,3807922Mixed18.09-py3
814,45211928Mixed18.09-py3
6455,49748759Mixed18.09-py3
12875,217106887Mixed18.09-py3

GPU Memory: 16 GB  |  CPU: E5-2698 v4 2.2 GHz  |  TensorRT 5.0  |  WMT16 English-German

 

RNN Inference Performance

GPU: TESLA T4
NETWORKBATCH SIZETHROUGHPUT
(total tokens/sec)
EFFICIENCY
(tokens/sec/watt)
LATENCY
(ms)
PRECISIONCONTAINER
NMT12,4303721Mixed18.10-py3
23,7015728Mixed18.10-py3
46,1689733Mixed18.10-py3
810,10116141Mixed18.10-py3
6429,012646113Mixed18.10-py3
12834,435628191Mixed18.10-py3

CPU: Gold 6140 2.3 GHz  |  TensorRT 5.0  |  WMT16 English-German

 

Last updated: Nov 20, 2018