• Language modeling: predicting what word comes next
  • LM is a benchmark task that helps us measure our progress or predicting language use
    • Pre-DL: n-grams
  • The standard evaluation matrix is perplexity

The core concepts here are:

  1. word embedding with fixed window
  2. we make a hidden layer
  3. we softmax the distribution

RNN are a family of neural architectures: everytime you insert a new hidden state, you use multiple layers.

We multiply probability of a precedent state and we use it for the following states.

This is the sum of all losses.

The gradient wrt (W_h) is given by summing over all timesteps (or relevant cells):

This is also called Backpropagation over time. BPTT.

Backpropagation.