RNN

Language modeling: predicting what word comes next
LM is a benchmark task that helps us measure our progress or predicting language use
- Pre-DL: n-grams
The standard evaluation matrix is perplexity

Perplexity = (t = 1 \prod T \frac{1}{P _{L M} ( x _{t + 1} ∣ x _{1} , \dots , x _{t} )})^{\frac{1}{T}}

The core concepts here are:

RNN are a family of neural architectures: everytime you insert a new hidden state, you use multiple layers.

We multiply probability of a precedent state and we use it for the following states.

J (Θ) = \frac{1}{T} t = 1 \sum T J^{(t)} (Θ)

This is the sum of all losses.

The gradient wrt (W_h) is given by summing over all timesteps (or relevant cells):

\frac{\partial J}{\partial W _{h}} = t = 1 \sum T \frac{\partial J ^{(t)}}{\partial W _{h}}

This is also called Backpropagation over time. BPTT.

😎 Appunti di Dag7