As far as we advance in RNN layers and we do Backpropagation, gradient becomes weaker and weaker.

To solve this: LSTM! Long-Short Term Memory, which work as computer’s RAM.

For each step t, there is a hidden state and a cell state

  • both are of the same length
  • cell stores long-term information
  • LSTM can insert, delete or read information from cell state
  • The selection of keeping information or no is managed from three gates of n length, that can be 0, 1 or 0/1
    • Forget gate:
    • Input gate:
    • Output gate: