- Language modeling: predicting what word comes next
- LM is a benchmark task that helps us measure our progress or predicting language use
- Pre-DL: n-grams
- The standard evaluation matrix is perplexity
The core concepts here are:
- word embedding with fixed window
- we make a hidden layer
- we softmax the distribution
RNN are a family of neural architectures: everytime you insert a new hidden state, you use multiple layers.
We multiply probability of a precedent state and we use it for the following states.
This is the sum of all losses.
The gradient wrt (W_h) is given by summing over all timesteps (or relevant cells):
This is also called Backpropagation over time. BPTT.