BPE Byte Pair Encoding is an algorithm to segment words

Related: Word Segmentation

Algorithm

function byte_pair_encoding(strings S, number of merges K): V vocab
	V <-- all unique chars in S    # initial set of tokens is characters
	for i = 1 to k:        # merge tokens till k times
		t1, tr             # most frequent pair of adjacent tokens in S
		tnew <- t1 + t2    # make new tokens by concatening
		V <-- V + tnew     # update vocabulary
	ret V

Example

First, we need the “K” passes.

(hug, 10) (pug, 5) (pun, 12) (bun, 4) (hugs, 5)
V = h,u,g,p,n,b,s
with k = 2

1.
corpus
(h,u,g 10) (p,u,g 5) (p,u,n 12) (b,u,n 4) (h,u,g,s 5)

most frequent couple? ug --> cost 25

so I update V

V += {ug}
V = h,u,g,p,n,b,s,ug

2.
corpus
(h,ug 10) (p,ug 5) (p,u,n 12) (b,u,n 4) (h,ug,s 5)

most frequent couple? un
cost 16
corpus
(h,ug 10) (p,ug 5) (p,un 12) (b,un 4) (h,ug,s 5)
V += {un}
V = h,u,g,p,n,b,s,ug,un

3.
most frequent couple? hug
cost 15
corpus
(hug 10) (p,ug 5) (p,un 12) (b,un 4) (hug,s 5)
V += {hug}
V = h,u,g,p,n,b,s,ug,un,hug