BPE Byte Pair Encoding is an algorithm to segment words
Related: Word Segmentation
Algorithm
function byte_pair_encoding(strings S, number of merges K): V vocab
V <-- all unique chars in S # initial set of tokens is characters
for i = 1 to k: # merge tokens till k times
t1, tr # most frequent pair of adjacent tokens in S
tnew <- t1 + t2 # make new tokens by concatening
V <-- V + tnew # update vocabulary
ret V
Example
First, we need the “K” passes.
(hug, 10) (pug, 5) (pun, 12) (bun, 4) (hugs, 5)
V = h,u,g,p,n,b,s
with k = 2
1.
corpus
(h,u,g 10) (p,u,g 5) (p,u,n 12) (b,u,n 4) (h,u,g,s 5)
most frequent couple? ug --> cost 25
so I update V
V += {ug}
V = h,u,g,p,n,b,s,ug
2.
corpus
(h,ug 10) (p,ug 5) (p,u,n 12) (b,u,n 4) (h,ug,s 5)
most frequent couple? un
cost 16
corpus
(h,ug 10) (p,ug 5) (p,un 12) (b,un 4) (h,ug,s 5)
V += {un}
V = h,u,g,p,n,b,s,ug,un
3.
most frequent couple? hug
cost 15
corpus
(hug 10) (p,ug 5) (p,un 12) (b,un 4) (hug,s 5)
V += {hug}
V = h,u,g,p,n,b,s,ug,un,hug