Edit Distance in Python

104,148

Solution 1

The thing you are looking at is called an edit distance and here is a nice explanation on wiki. There are a lot of ways how to define a distance between the two words and the one that you want is called Levenshtein distance and here is a DP (dynamic programming) implementation in python.

def levenshteinDistance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

And a couple of more implementations are here.

Solution 2

difflib in the standard library has various utilities for sequence matching, including the get_close_matches method that you could use. It uses an algorithm adapted from Ratcliff and Obershelp.

From the docs

>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']

Solution 3

Here is my version for Levenshtein distance

def edit_distance(s1, s2):
    m=len(s1)+1
    n=len(s2)+1

    tbl = {}
    for i in range(m): tbl[i,0]=i
    for j in range(n): tbl[0,j]=j
    for i in range(1, m):
        for j in range(1, n):
            cost = 0 if s1[i-1] == s2[j-1] else 1
            tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)

    return tbl[i,j]

print(edit_distance("Helloworld", "HalloWorld"))

Solution 4

#this calculates edit distance not levenstein edit distance
word1="rice"

word2="ice"

len_1=len(word1)

len_2=len(word2)

x =[[0]*(len_2+1) for _ in range(len_1+1)]#the matrix whose last element ->edit distance

for i in range(0,len_1+1): #initialization of base case values

    x[i][0]=i
for j in range(0,len_2+1):

    x[0][j]=j
for i in range (1,len_1+1):

    for j in range(1,len_2+1):

        if word1[i-1]==word2[j-1]:
            x[i][j] = x[i-1][j-1] 

        else :
            x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1

print x[i][j]

Solution 5

I would recommend not creating this kind of code on your own. There are libraries for that.

For instance the Levenshtein library.


In [2]: Levenshtein.distance("foo", "foobar")
Out[2]: 3

In [3]: Levenshtein.distance("barfoo", "foobar")
Out[3]: 6

In [4]: Levenshtein.distance("Buroucrazy", "Bureaucracy")
Out[4]: 3

In [5]: Levenshtein.distance("Misisipi", "Mississippi")
Out[5]: 3

In [6]: Levenshtein.distance("Misisipi", "Misty Mountains")
Out[6]: 11

In [7]: Levenshtein.distance("Buroucrazy", "Born Crazy")
Out[7]: 4

Share:
104,148
Mel
Author by

Mel

Updated on July 09, 2022

Comments

  • Mel
    Mel almost 2 years

    I'm programming a spellcheck program in Python. I have a list of valid words (the dictionary) and I need to output a list of words from this dictionary that have an edit distance of 2 from a given invalid word.

    I know I need to start by generating a list with an edit distance of one from the invalid word(and then run that again on all the generated words). I have three methods, inserts(...), deletions(...) and changes(...) that should output a list of words with an edit distance of 1, where inserts outputs all valid words with one more letter than the given word, deletions outputs all valid words with one less letter, and changes outputs all valid words with one different letter.

    I've checked a bunch of places but I can't seem to find an algorithm that describes this process. All the ideas I've come up with involve looping through the dictionary list multiple times, which would be extremely time consuming. If anyone could offer some insight, I'd be extremely grateful.

  • Nikana Reklawyks
    Nikana Reklawyks over 6 years
    DP standing for dynamic programming.
  • Artur Krajewski
    Artur Krajewski over 5 years
    Your code is incorrect. Logic is unsound and return tbl[i,j] should be return tbl[n,m].
  • alancalvitti
    alancalvitti over 5 years
    @Salvador Dali Shouldn't adjacent transposition return a distance of 1? The above function gives levenshteinDistance("abc","bac") --> 2
  • graffe
    graffe about 5 years
    @alancalvitti The only operations are insert, delete and substitute. So in you example you need to delete the a and then reinsert it between the b and c. Two operations.
  • Outcast
    Outcast over 4 years
    Hello :), do you have by chance a version of it (or of another algorithm) where you can specify an upper bound at the levenshtein distance? (You may check also my post for more details: stackoverflow.com/questions/59686989/…)
  • Russ
    Russ over 3 years
    Running this for '123' and '12345' produces a required_edits of length 1, while the edit distance is 2.
  • Glenn Slayden
    Glenn Slayden over 2 years
    You are referencing a variable s (i.e., s[i-1]!=s[j-1]) which is not defined in your code.
  • J.Michael
    J.Michael over 2 years
    Thanks made the correction! Also noted that in Python True + n = n+1, and False + n =n so no need to explicitly convert to type 'int'
  • MERose
    MERose over 2 years
    ...except that this is not the edit/Levenshtein distance.
  • ryanjdillon
    ryanjdillon over 2 years
    @MERose My answer is aimed at giving a solution to the problem that the question presents for other looking to solve the same problem-- and not to answer it per what appears to be a homework assignment's specifications.
  • dantiston
    dantiston over 2 years
    This is amazing!