A Convolutional Neural Network is a class of artificial neural network that uses convolutional layers to filter inputs for useful information. The convolution operation involves combining input data (feature map) with a convolution kernel (filter) to form a transformed feature map. The filters in the convolutional layers (conv layers) are modified based on learned parameters to extract the most useful information for a specific task. Convolutional networks adjust automatically to find the best feature based on the task. The CNN would filter information about the shape of an object when confronted with a general object recognition task but would extract the color of the bird when faced with a bird recognition task. This is based on the CNN’s understanding that different classes of objects have different shapes but that different types of birds are more likely to differ in color than in shape.

Applications of Convolutional Neural Networks include various image (image recognition, image classification, video labeling, text analysis) and speech (speech recognition, natural language processing, text classification) processing systems, along with state-of-the-art AI systems such as robots,virtual assistants, and self-driving cars.

Components of a Convolutional Neural Network

Convolutional networks are composed of an input layer, an output layer, and one or more hidden layers. A convolutional network is different than a regular neural network in that the neurons in its layers are arranged in three dimensions (width, height, and depth dimensions). This allows the CNN to transform an input volume in three dimensions to an output volume. The hidden layers are a combination of convolution layers, pooling layers, normalization layers, and fully connected layers. CNNs use multiple conv layers to filter input volumes to greater levels of abstraction.

CNNs improve their detection capability for unusually placed objects by using pooling layers for limited translation and rotation invariance. Pooling also allows for the usage of more convolutional layers by reducing memory consumption. Normalization layers are used to normalize over local input regions by moving all inputs in a layer towards a mean of zero and variance of one. Other regularization techniques such as batch normalization, where we normalize across the activations for the entire batch, or dropout, where we ignore randomly chosen neurons during the training process, can also be used. Fully-connected layers have neurons that are functionally similar to convolutional layers (compute dot products) but are different in that they are connected to all activations in the previous layer.

More recent CNNs use inception modules which use 1×1 convolutional kernels to reduce the memory consumption further while allowing for more efficient computation (and thus training). This makes CNNs suitable for a number of machine learning applications.

Figure 1: An input image of a traffic sign is filtered by 4 5×5 convolutional kernels which create 4 feature maps, these feature maps are subsampled by max pooling. The next layer applies 10 5×5 convolutional kernels to these subsampled images and again we pool the feature maps. The final layer is a fully connected layer where all generated features are combined and used in the classifier (essentially logistic regression). Image by Maurice Peemen.


An activation function in a neural network applies a non-linear transformation on weighted input data. A popular activation function for CNNs is ReLu or rectified linear function which zeros out negative inputs and is represented as . The rectified linear function speeds up training while not compromising significantly on accuracy.


Inception modules in CNNs allow for deeper and larger conv layers while also speeding up computation. This is done by using 1×1 convolutions with small feature map size, for example, 192 28×28 sized feature maps can be reduced to 64 28×28 feature maps through 64 1×1 convolutions. Because of the reduced size, these 1×1 convolutions can be followed up with larger convolutions of size 3×3 and 5×5. In addition to 1×1 convolution, max pooling may also be used to reduce dimensionality. In the output of an inception module, all the large convolutions are concatenated into a big feature map which is then fed into the next layer (or inception module).


Pooling is a procedure that reduces the input over a certain area to a single value (subsampling). In convolutional neural networks, this concentration of information provides similar information to outgoing connections with reduced memory consumption. Pooling provides basic invariance to rotations and translations and improves the object detection capability of convolutional networks. For example, the face on an image patch that is not in the center of the image but slightly translated, can still be detected by the convolutional filters because the information is funneled into the right place by the pooling operation. The larger the size of the pooling area, the more information is condensed, which leads to slim networks that fit more easily into GPU memory. However, if the pooling area is too large, too much information is thrown away and predictive performance decreases.

CNN Training and Inference

Like multi-layer perceptrons and recurrent neural networks, convolutional neural networks can also be trained using gradient-based optimization techniques. Stochastic, batch, or mini-batch gradient descent algorithms can be used to optimize the parameters of the neural network. Once the CNN has been trained, it can be then used for inference to accurately predict outputs for a given input.

Accelerating Convolutional Neural Networks using GPUs

Deep learning frameworks allow researchers to create and explore Convolutional Neural Networks (CNNs) and other Deep Neural Networks (DNNs) easily, while delivering the high speed needed for both experiments and industrial deployment. The NVIDIA Deep Learning SDK accelerates widely-used deep learning frameworks such as Caffe, CNTK, TensorFlow, Theano and Torch as well as many other machine learning applications. The deep learning frameworks run faster on GPUs and scale across multiple GPUs within a single node. To use the frameworks with GPUs for Convolutional Neural Network training and inference processes, NVIDIA provides cuDNN and TensorRT respectively. cuDNN and TensorRT provide highly tuned implementations for standard routines such as convolution, pooling, normalization, and activation layers.

Click here for a step-by-step installation and usage guide. A fast C++/CUDA implementation of convolutional neural networks can be found here.

Additional Resources

  1. “Deep Learning in a Nutshell: Core Concepts” Dettmers, Tim. Parallel For All. NVIDIA, 3 Nov 2015.
  2. “Understanding Convolution in Deep Learning” Dettmers, Tim. TD Blog, 26 Mar 2015.
  3. “Object Recognition with Neural Nets” Hinton, Geoffrey et al. Coursera, 5 Nov, 2013
  4. “Deep Learning” Nielsen, Michael. Neural Networks and Deep Learning online book, Dec, 2017.
  5. “Convolutional Neural Networks for Visual Recognition” Li, Fei-Fei et al. Stanford University Courses, Spring 2017.
  6. “Going Deeper with Convolutions” Szegedy, Christian et al. CVPR, 2015.
  7. “Computer vision - pooling and subsampling” Larochelle, Hugo. Neural networks [9.5], 15 Nov, 2013