n-grams represent sequence of words, and help computers to understand how words can be associated each other

We can approximate the history of n-1 words.

Formal rules

C = how many times a word appears in text

Exercise

  • Build a Bigram Model

I am Sam, Sam I am, Sam I like

  1. put the start and the ending symbols: <s> I am sam </s>, <s>Sam I like</s>
  2. for each sentence divide in bigrams: (<s>, I)(I, am)(am, Sam)(Sam, <s>)...
  3. count the duplicates: (<s>, Sam) = 3 times
  4. count the frequency of each word (any word): count(like) = 3
    1. where 3 = times that I’ve seen the couple
    2. where 5 is the frequency with Sam
  5. Repeat until I get all segments of a certain sentence
    1. P(Sam|<s>)
    2. P(do|Sam)
    3. multiply all the probabilities
  6. Apply LaPlace
    1. Sum to bottom part of the fraction, and 1 to upper part