N-Grams

n-grams represent sequence of words, and help computers to understand how words can be associated each other

We can approximate the history of n-1 words.

Formal rules

B i g r am s : P (W_{n} ∣ W_{1} \dots W_{n - 1}) W or d S e q u e n ce : P (W_{1}^{n}) C hain r u l e : k = 1 \prod n P (W_{k} ∣ W_{1}^{k - 1}) B i g r am A pp ro x ima t i o n = \frac{C ( W _{n - 1} W _{n} ) ^{k = 1}}{C ( W _{n - 1} )} N g r am a pp ro x ima t i o n = \frac{C ( W _{n - N + 1} \dots W _{n - 1} W _{n} )}{C ( W _{n - N + 1} \dots W _{n - 1} )} \approx P (W_{n} ∣ W_{n - 1}) = W_{1} \dots W_{n}

C = how many times a word appears in text

I am Sam, Sam I am, Sam I like

put the start and the ending symbols: <s> I am sam </s>, <s>Sam I like</s>
for each sentence divide in bigrams: (<s>, I)(I, am)(am, Sam)(Sam, <s>)...
count the duplicates: (<s>, Sam) = 3 times
count the frequency of each word (any word): count(like) = 3
$P (w_{2}, w_{1}) = \frac{co u n t ( w _{1} , w _{2} )}{co u n t ( w _{1} )} \to P (S am ∣ < s >) = \frac{3}{5} = 0.6$
1. where 3 = times that I’ve seen the couple
2. where 5 is the frequency with Sam
Repeat until I get all segments of a certain sentence
1. P(Sam|<s>)
2. P(do|Sam)
3. multiply all the probabilities
Apply LaPlace
1. Sum $∣ V ∣$ to bottom part of the fraction, and 1 to upper part $\to$ $P (S am ∣ < s >) = \frac{3 + 1}{5 + 6} = \frac{4}{11}$