GPU Accelerated Deep Learning

The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers. cuDNN is part of the NVIDIA Deep Learning SDK.

Deep learning researchers and framework developers worldwide rely on cuDNN for high-performance GPU acceleration. It allows them to focus on training neural networks and developing software applications rather than spending time on low-level GPU performance tuning. cuDNN accelerates widely used deep learning frameworks, including Caffe, TensorFlow, Theano, Torch, and CNTK. See supported frameworks for more details.


Register      Download

cuDNN is freely available to members of the Accelerated Computing Developer Program


Data scientists and researchers can take advantage of cuDNN by downloading a Deep Learning frameworks or NVIDIA DIGITS. DIGITS lets you interactively manage data, perform training on multiple GPUs, and export the best performing model for deployment without the need to write code.

What’s New in cuDNN 5

cuDNN 5.1

  • 2.7x faster training of networks with with 3x3 convolutions such as VGG

cuDNN 5.0

  • LSTM recurrent neural networks deliver up to 6x speedup in Torch.
  • Perform training up to 44% faster on a single Pascal GPU.
  • Accelerate networks with 3x3 convolutions, such as VGG, GoogleNet, and ResNets.
  • Improve performance and reduce memory usage with FP16 routines on Pascal GPUs.

cuDNN 4 + K40 vs. cuDNN 5.1 + M40 on Torch and Intel Xeon Haswell single-socket 16-core E5-2698 v3@2.3Ghz 3.6GHz Turbo

Visit the What’s New page to explore top features from previous releases of cuDNN.

Key Features

  • Forward and backward paths for many common layer types such as pooling, LRN, LCN and batch normalization, ReLU, Sigmoid, softmax and Tanh
  • Forward and backward convolution routines, including cross-correlation, designed for convolutional neural nets
  • Recurrent Neural Networks (LSTM/GRU/RNN) that deliver up to 6x speedup in Torch
  • Arbitrary dimension ordering, striding, and sub-regions for 4d tensors means easy integration into any neural net implementation
  • Tensor transformation functions
  • Context-based API allows for easy multithreading

cuDNN is supported on Windows, Linux and MacOS systems with Pascal, Kepler, Maxwell, Tegra K1 or Tegra X1 GPUs.

AlexNet training throughput on:
CPU: 1x E5-2680v3 12 Core 2.5GHz. 128GB System Memory, Ubuntu 14.04
M40 bar: 8x M40 GPUs in a node, P100: 8x P100 NVLink-enabled

We are amazed by the steady stream of improvements made to the NVIDIA Deep Learning SDK and the speedups that they deliver. This new version of the SDK, significantly improves our convolution algorithms, and goes so far as to accelerate the 3D convolution by a factor of 3x! On top of that, we are excited about their decision to provide tools for other models such as LSTM, RNN and GRU in this new version.

Frédéric Bastien, Team Lead - Software Infrastructure at MILA

Learn More