is the loss function; its gradient is . The algorithm is called gradient descent.

It points in the direction of the steepest increase of the function, so moving in the opposite direction decreases the loss fastest.

The idea:

  • for the current value of , calculate the gradient of
  • then take a small step in the direction of the negative gradient.

Therefore:

The problem is that works for all the corpus but we cannot wait for termination for every computation.

We can use SGD:

  • SGD: skipgrams predict outside given center
  • CBOW: center bag of words predict center from outside words