is the loss function; its gradient is . The algorithm is called gradient descent.
It points in the direction of the steepest increase of the function, so moving in the opposite direction decreases the loss fastest.
The idea:
- for the current value of , calculate the gradient of
- then take a small step in the direction of the negative gradient.
Therefore:
The problem is that works for all the corpus but we cannot wait for termination for every computation.
We can use SGD:
- SGD: skipgrams predict outside given center
- CBOW: center bag of words predict center from outside words