Python: find closest string (from a list) to another string

96,715

Solution 1

Use difflib.get_close_matches.

>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> difflib.get_close_matches('Hello', words)
['hello', 'Hallo', 'hallo']

Please look at the documentation, because the function returns 3 or less closest matches by default.

Solution 2

There is an awesome article with a complete source code (21 lines) provided by Peter Norvig on spelling correction.

http://norvig.com/spell-correct.html

The idea is to build all possible edits of your word,

hello - helo   - deletes    
hello - helol  - transpose    
hello - hallo  - replaces    
hello - heallo - inserts    


def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

Now, look up each of these edits in your list.

Peter's article is a great read and worth reading.

Solution 3

I was looking at this answer for getting a closest match from a list or possible alternatives of

difflib.get_close_matches

I found Amjith's answer based on Peter Norwig's post and thought it might be a good replacement. Before I realised it might not be quite much suited for my use case, I had made a class out of it. So this is a version of spell where you can use it for different set of words. Maybe best use can be for population name matching.

import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

# WORDS = Counter(words(open('big.txt').read()))


class WordMatcher:

    def __init__(self, big_text):
        self.WORDS=Counter(words(big_text))
        self.N = sum(self.WORDS.values())

    def P(self, word):
        "Probability of `word`."
        return self.WORDS[word] / self.N

    def correction(self, word):
        "Most probable spelling correction for word."
        return max(self.candidates(word), key=self.P)

    def candidates(self, word):
        "Generate possible spelling corrections for word."
        return (self.known([word]) or self.known(self.edits1(word)) or self.known(self.edits2(word)) or [word])

    def known(self, words):
        "The subset of `words` that appear in the dictionary of WORDS."
        return set(w for w in words if w in self.WORDS)

    def edits1(self, word):
        "All edits that are one edit away from `word`."
        letters    = 'abcdefghijklmnopqrstuvwxyz'
        splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
        deletes    = [L + R[1:]               for L, R in splits if R]
        transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
        replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
        inserts    = [L + c + R               for L, R in splits for c in letters]
        return set(deletes + transposes + replaces + inserts)

    def edits2(self, word):
        "All edits that are two edits away from `word`."
        return (e2 for e1 in self.edits1(word) for e2 in self.edits1(e1))

Solution 4

Create a sorted list of your words and use the bisect module to identify the point in the sorted list where your word would fit according to the sorting order. Based on that position you can give the k nearest neighbours above and below to find the 2k closest words.

Share:
96,715

Related videos on Youtube

Laura
Author by

Laura

Updated on July 05, 2022

Comments

  • Laura
    Laura almost 2 years

    Let's say I have a string "Hello" and a list

    words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo','question', 'Hallo', 'format']
    

    How can I find the n words that are the closest to "Hello" and present in the list words ?

    In this case, we would have ['hello', 'hallo', 'Hallo', 'hi', 'format'...]

    So the strategy is to sort the list words from the closest word to the furthest.

    I thought about something like this

    word = 'Hello'
    for i, item in enumerate(words):
        if lower(item) > lower(word):
          ...
    

    but it's very slow in large lists.

    UPDATE difflib works but it's very slow also. (words list has 630000+ words inside (sorted and one per line)). So checking the list takes 5 to 7 seconds for every search for closest word!

    • tripleee
      tripleee about 12 years
      Maybe you are looking for something like editing distance or Levinshtein distance?
    • Gareth Latty
      Gareth Latty about 12 years
      Exactly, this largely depends on what your definition of 'closest' is.
    • Peter Wood
      Peter Wood about 12 years
      Are the 630,000 words sorted? Are they in a file, one word per line?
    • Nick Johnson
      Nick Johnson about 12 years
      How do you intend to define 'closest'? In your sample code, you're using a lexicographic comparison, but that ranks 'hermitage' as a better match for 'hello' than 'jello' is.
    • Nick
      Nick over 8 years
      Did you find an efficient solution for 6M+ dictionary items? I'm stocked here as well.
  • Niklas B.
    Niklas B. about 12 years
    Just a quick FYI: difflib.get_close_matches("Hallo", words, len(words), 0) would give all matches :)
  • Oleh Prypin
    Oleh Prypin about 12 years
    Are you sure this will work? I don't think lexicographical order is what OP wants...
  • Maksym Polshcha
    Maksym Polshcha about 12 years
    The Levenshtein difference can be used as well. There is a good python implementation pypi.python.org/pypi/python-Levenshtein
  • user1308520
    user1308520 about 12 years
    Considering the code snippet from the question, such a simple solution could be all that is required.
  • Laura
    Laura about 12 years
    difflib works but it's very slow also. (words list has 630000+ words inside). So checking the list takes 5 to 7 seconds for every search for closest word!
  • Peter Wood
    Peter Wood about 12 years
    @Laura There is a difference between being slow and taking a long time. 7 seconds might be a long time but it might not be slow.
  • laurasia
    laurasia over 11 years
    Thanks, that is a great find. I am writing an integrated dictionary and this might be useful.
  • laurasia
    laurasia over 11 years
    This won't work if possible spelling mistakes are to be considered, especially if mistakes are made at the beginning of the word. But a good solution for correct words indeed.
  • toto_tico
    toto_tico almost 6 years
    @Oleh, I was wondering if you happen to know if there is an alternative of difflib.get_close_matches that return the list indexes, not the strings. I posted another question here
  • stud_eco
    stud_eco about 4 years
    Hi @NiklasB, in this case how can I get only one closest word
  • fuzzyTew
    fuzzyTew over 3 years
    needs implementation of 'close'
  • MAC
    MAC over 3 years
    Can someone point out to algorithm behind difflib? I am curious to read about it.
  • Charalamm
    Charalamm over 3 years
    I am putting a string from the list but it isn't returning something. Could it be because I am using non-english strings/characters? I have tried also lowering the cutoff
  • Pablo
    Pablo over 3 years
    This is just a datastructure to hold the words, it explains nothing about computing similarity.
  • hafiz031
    hafiz031 about 3 years
    If I am not mistaken, this should not work. For example: zoom and boom have much similar spelling, but they will be situated at the two opposite ends of your sorted list. How do you match them?