A Recurrent Neural Network (RNN) is a class of artificial neural network that has memory or feedback loops that allow it to better recognize patterns in data. RNNs are an extension of regular artificial neural networks that add connections feeding the hidden layers of the neural network back into themselves - these are called recurrent connections. The recurrent connections provide a recurrent network with visibility of not just the current data sample it has been provided, but also it's previous hidden state. A recurrent network with a feedback loop can be visualized as multiple copies of a neural network, with the output of one serving as an input to the next. Unlike traditional neural networks, recurrent nets use their understanding of past events to process the input vector rather than starting from scratch every time.

A RNN is particularly useful when a sequence of data is being processed to make a classification decision or regression estimate but it can also be used on non-sequential data. Recurrent neural networks are typically used to solve tasks related to time series data. Applications of recurrent neural networks include natural language processing, speech recognition, machine translation, character-level language modeling, image classification, image captioning, stock prediction, and financial engineering. We can teach RNNs to learn and understand sequences of words. RNNs can also be used to generate sequences mimicking everything from Shakespeare to Linux source code, to baby names.

Some types of recurrent neural networks have a memory that enables them to remember important events that happened many time steps in the past. What distinguishes sequence learning from other regression and classification tasks is the need to use models such as LSTMs (Long Short-Term Memory) to learn temporal dependence in input data. This memory of past input is crucial for successful sequence learning.

To process sequential data (text, speech, video, etc.), we could feed the data vector to a regular neural network. This approach would however be limited by the fixed input vector size and also the possibility of important events in a sequence lying just outside of the input window. The limitations can be overcome by instead using recurrent networks.

We can feed the recurrent nets with data sequences of arbitrary length, one element of the sequence per time step - A video input to a RNN for example would be fed one frame at a time. Another example is that of binary addition which could either be done using either a regular feed-forward neural network or an RNN. For the feedforward network we would need to choose the maximum number of digits in each binary number in advance and would also not be able to apply the knowledge learned about adding digits at the beginning of the vector to the digits at the end of the vector.

Machine translation refers to the translation, using a machine, of a source sequence (sentence, paragraph, document) in one language to a corresponding target sequence or vector in another language. Since one source sentence can be translated in many different ways, the translation is essentially one-to-many, and the translation function is modeled as conditional rather than deterministic. In Neural machine translation (NMT), we let a neural network learn how to do the translation from data rather than from a set of designed rules. Since we are dealing with time series data where the context and order of words is important, the network of choice for NMT is a recurrent neural network. An NMT can be augmented with a technique called attention, which helps the model drive its focus onto important parts of the input and improve the prediction process.

Figure 1: NMT Model

Like multi-layer perceptrons and convolutional neural networks, recurrent neural networks can also be trained using the stochastic gradient descent (SGD), batch gradient descent, or mini-batch gradient descent algorithms. The only difference is in the back-propagation step that computes the weight updates for our slightly more complex network structure. After the error in the prediction is calculated in the first pass through the network, the error gradient, starting at the last output neuron, is computed and back-propagated to the hidden units for that time-step. This process is then repeated for each of the previous time-steps in order. The gradients that back-propagate to the hidden units are coming from both the output neurons and the units in the hidden state one step ahead in the sequence. We call this process Backpropagation Through Time (BPTT).

We can increase the number of neurons in the hidden layer and we can stack multiple hidden layers to create a deep RNN architecture. Unfortunately simple RNNs with many stacked layers can be brittle and difficult to train. This brittleness arises because the backpropagation of gradients within a neural network is a recursive multiplication process. This means that if the gradients are small they will shrink exponentially and if they are large they will grow exponentially. These problems are called the "vanishing" and "exploding" gradients respectively.

As detailed above, vanilla RNNs have trouble with training due to the output for a given input either decaying or exploding as it cycles through the feedback loops.

LSTM provides better performance compared to other RNN architectures by alleviating what is called the vanishing gradient problem, where the gradient gets smaller and smaller with each layer until it is too small to affect the deepest layers.

GRUs are a simpler variation of LSTMs. They have fewer parameters, no output gate, and combine the cell state with the hidden state. Consequently GRUs are faster to train than LSTMs.

Neural Turing Machines (NTMs) are Recurrent Neural Networks coupled with external memory resources. NTMs can be thought of as extensions of NMT with soft attention mechanism.

Bidirectional RNNs train the input vector on two recurrent nets - one on the regular input sequence and the other on the reversed input sequence. The outputs of the two networks are then concatenated.

Recurrent Neural Networks have additional recurrent connections compared to regular neural networks that enable them to remember past processed information. These connections however make it more computationally intensive to train a RNN. The parallel processing capabilities of GPUs can accelerate both the training and inference processes of RNNs. cuDNN is a GPU-accelerated library of primitives for deep neural networks that optimizes RNN performance on NVIDIA GPUs. TensorRT is a high-performance deep learning inference optimizer and runtime that delivers low latency, high-throughput inference for deep learning applications. Both cuDNN and TensorRT are part of the NVIDIA Deep Learning SDK and support four RNN modes: Simple RNN with ReLU activation function, simple RNN with tanh activation function, Gated Recurrent Units (GRU), and Long Short-Term Memory (LSTM).

- “Deep Learning in a Nutshell: Sequence Learning” Dettmers, Tim. Parallel For All. NVIDIA, 7 Mar 2016.
- “Understanding Natural language with Deep Neural Networks Using Torch” Chintala, Soumith. Parallel For All. NVIDIA, 3 Mar 2015.
- “The Unreasonable Effectiveness of Recurrent Neural Networks” Karpathy, Andrej. AK Blog, 21 May 2015.
- “Optimizing Recurrent Neural Networks in cuDNN 5” Appleyard, Jeremy. Parallel For All. NVIDIA, 6 Apr 2016.
- “5 Examples of Simple Sequence Prediction Problems” Brownlee, Jason. Machine Learning Mastery, 19 July 2017.
- “Introduction to Neural Machine Translation with GPUs” Cho, Kyunghyun. NVIDIA Developer Blog, 27 May 2015.
- “Introduction to Recurrent Neural Networks” Barker, Jon. NVIDIA Deep Learning Institute, 01 July 2016.