Gradient

$J (θ)$ is the loss function; its gradient is $\nabla J (θ)$ . The algorithm is called gradient descent.

It points in the direction of the steepest increase of the function, so moving in the opposite direction decreases the loss fastest.

The idea:

θ^{new} = θ^{old} - α \nabla J (θ^{old})

Therefore:

θ_{j}^{n e w} = θ_{j}^{o l d} - α \frac{\partial}{\partial θ _{j}^{o l d}} J (θ)

The problem is that $J (ϕ)$ works for all the corpus but we cannot wait for termination for every computation.

We can use SGD:

😎 Appunti di Dag7