Minimum Edit Distance is an algorithm to find how similar are two strings.
It measures the minimum cost for the application of the editing operations to be performed in order to align X to Y.
Edit operations are: insert, delete and substitute
Initialization: create a matrix with rows (n=|x|) and m+1 cols (in other words, a row and a column more than the length of the two words, with all zeros)
For each cell, compare the adjacent cells (up, left, diagonal left) + 1: take the minimum and write it down in the cell. If x and y are the same though, if the letters are the same then write 0, else copy cost.
def min_edit_dist(s1, s2):
m, n = len(s1), len(s2)
dp = [[0]*(n+1) for _ in range(m+1)]
for i in range(m+1): dp[i][0] = i
for j in range(n+1): dp[0][j] = j
for i in range(1, m+1):
for j in range(1, n+1):
if s1[i-1] == s2[j-1]:
dp[i][j] = dp[i-1][j-1]
else:
dp[i][j] = 1 + min(dp[i][j-1], dp[i-1][j], dp[i-1][j-1])
return dp[m][n]Applications
- hey vs hello
| ” | h | e | l | l | o | |
|---|---|---|---|---|---|---|
| ” | 0 | 1 | 2 | 3 | 4 | 5 |
| h | 1 | 0 | 1 | 2 | 3 | 4 |
| e | 2 | 1 | 0 | 1 | 2 | 3 |
| y | 3 | 2 | 1 | 1 | 2 | 3 |
- mela vs mamma
| ” | m | a | m | m | a | |
|---|---|---|---|---|---|---|
| ” | 0 | 1 | 2 | 3 | 4 | 5 |
| m | 1 | 0 | 1 | 2 | 3 | 4 |
| e | 2 | 1 | 1 | 2 | 3 | 4 |
| l | 3 | 2 | 2 | 2 | 3 | 4 |
| a | 4 | 3 | 2 | 3 | 3 | 3 |
Note that sometimes, the letters are the same, example a with a. If yes, you should copy the top left corner value (and cost = 0), else cost = 1.