Long Short-Term Memory (LSTM)

A Long short-term memory (LSTM) is a type of Recurrent Neural Network specially designed to prevent the neural network output for a given input from either decaying or exploding as it cycles through the feedback loops. The feedback loops are what allow recurrent networks to be better at pattern recognition than other neural networks. Memory of past input is critical for solving sequence learning tasks and Long short-term memory networks provide better performance compared to other RNN architectures by alleviating what is called the vanishing gradient problem.

LSTMs due to their ability to learn long term dependencies are applicable to a number of sequence learning problems including language modeling and translation, acoustic modeling of speech, speech synthesis,speech recognition, audio and video data analysis, handwriting recognition and generation, sequence prediction, and protein secondary structure prediction.

Long Short-Term Memory Architecture

The Long Short-Term Memory Architecture consists of linear units with a self-connection having a constant weight of 1.0. This allows a value (forward pass) or gradient (backward pass) that flows into this self-recurrent unit to be preserved and subsequently retrieved at the required time step. With the unit multiplier, the output or error of the previous time step is the same as the output for the next time step. This self-recurrent unit, the memory cell, is capable of storing information which lies dozen of time-steps in the past. This is very powerful for many tasks. For example for text data, an LSTM unit can store information contained in the previous paragraph and apply this information to a sentence in the current paragraph.

Figure 1: A Long Short-Term Memory (LSTM) unit. The LSTM unit has four input weights (from the data to the input and three gates) and four recurrent weights (from the output to the input and the three gates). Peepholes are extra connections between the memory cell and the gates, but they do not increase the performance by much and are often omitted for simplicity. Image by Klaus Greff and colleagues as published in LSTM: A Search Space Odyssey. Image by Klaus Greff and colleagues as published in LSTM: A Search Space Odyssey.

Bidirectional LSTMs train the input sequence on two LSTMs - one on the regular input sequence and the other on the reversed input sequence. This can improve LSTM network performance by allowing future data to provide context for past data in a time series. These LSTM networks can better address complex sequence learning/ machine learning problems than simple feed-forward networks.

Vanishing Gradient Problem in LSTMs

A simple LSTM model only has a single hidden LSTM layer while a stacked LSTM model (needed for advanced applications) has multiple LSTM hidden layers. A common problem in deep networks is the “vanishing gradient” problem, where the gradient gets smaller and smaller with each layer until it is too small to affect the deepest layers. With the memory cell in LSTMs, we have continuous gradient flow (errors maintain their value) which thus eliminates the vanishing gradient problem and enables learning from sequences which are hundreds of time steps long.

Memory Gates in LSTMs

There are instances when we would want to throw away information in the memory cell, or cell state, and replace it with newer, more relevant information. At the same time, we do not want to confuse the rest of the recurrent net by releasing unnecessary information into the network. To solve this problem, the LSTM unit has a forget gate which deletes the information in the self-recurrent unit, making room for a new memory. It does so without releasing the information into the network, avoiding possible confusion. The forget gate does this by multiplying the value of the memory cell by a number between 0 (delete) and 1 (keep everything). The exact value is determined by the current input and the LSTM unit output of the previous time step.

At other times, the memory cell contains a value that needs to be preserved for many time steps. To do this the LSTM model adds another gate, the input gate or write gate, which can be closed so that no new information flows into the memory cell (see Figure 1). This way the data in the memory cell is protected until it is needed.

Another gate manipulates the output from the memory cell by multiplying the output of the memory cell by a number between 0 (no outputs) and 1 (preserve output) (see Figure 1). This output gate may be useful if multiple memories compete against each other.

Accelerating Long Short-Term Memory using GPUs

The parallel processing capabilities of GPUs can accelerate the LSTM training and inference processes. GPUs are the de-facto standard for LSTM usage and deliver a 6x speedup during training and 140x higher throughput during inference when compared to CPU implementations. cuDNN is a GPU-accelerated deep neural network library that supports training of LSTM recurrent neural networks for sequence learning. TensorRT is a deep learning model optimizer and runtime that supports inference of LSTM recurrent neural networks on GPUs. Both cuDNN and TensorRT are part of the NVIDIA Deep Learning SDK.

Additional Resources

  1. “Deep Learning in a Nutshell: Sequence Learning” Dettmers, Tim. Parallel For All. NVIDIA, 7 Mar 2016.
  2. “LSTM: A Search Space Odyssey” Greff, Klaus et al. Transactions on Neural Networks and Learning Systems. IEEE, 10 Oct 2017.
  3. “Long Short-Term Memory” Hochreiter, Sepp. Neural Computation Journal, 15 Nov 1997.
  4. “Optimizing Recurrent Neural Networks in cuDNN 5” Appleyard, Jeremy. Parallel For All. NVIDIA, 6 Apr 2016.
  5. “Understanding LSTM Networks” Olah, Christopher. Colah’s Blog, 27 Aug 2015.