n-grams represent sequence of words, and help computers to understand how words can be associated each other
We can approximate the history of n-1 words.
Formal rules
C = how many times a word appears in text
Exercise
- Build a Bigram Model
I am Sam, Sam I am, Sam I like
- put the start and the ending symbols:
<s> I am sam </s>,<s>Sam I like</s> - for each sentence divide in bigrams:
(<s>, I)(I, am)(am, Sam)(Sam, <s>)... - count the duplicates:
(<s>, Sam) = 3 times - count the frequency of each word (any word):
count(like) = 3 -
- where 3 = times that I’ve seen the couple
- where 5 is the frequency with
Sam
- Repeat until I get all segments of a certain sentence
P(Sam|<s>)P(do|Sam)- multiply all the probabilities
- Apply LaPlace
- Sum to bottom part of the fraction, and 1 to upper part