As far as we advance in RNN layers and we do Backpropagation, gradient becomes weaker and weaker.
To solve this: LSTM! Long-Short Term Memory, which work as computer’s RAM.
For each step t, there is a hidden state and a cell state
-
both are of the same length
-
cell stores long-term information
-
LSTM can insert, delete or read information from cell state
-
The selection of keeping information or no is managed from three gates of n length, that can be 0, 1 or 0/1
- Forget gate:
- Input gate:
- Output gate: