String similarity metrics in Python

49,316

Solution 1

There's a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.

http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

Here's a bit of the list:

  • Hamming distance
  • Levenshtein distance
  • Needleman-Wunch distance or Sellers Algorithm
  • and many more...

Solution 2

I realize it's not the same thing, but this is close enough:

>>> import difflib
>>> a = 'Hello, All you people'
>>> b = 'hello, all You peopl'
>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())
>>> seq.ratio()
0.97560975609756095

You can make this as a function

def similar(seq1, seq2):
    return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9

>>> similar(a, b)
True
>>> similar('Hello, world', 'Hi, world')
False

Solution 3

This snippet will calculate the difflib, Levenshtein, Sørensen, and Jaccard similarity values for two strings. In the snippet below, I was iterating over a tsv in which the strings of interest occupied columns [3] and [4] of the tsv. (pip install python-Levenshtein and pip install distance):

import codecs, difflib, Levenshtein, distance

with codecs.open("titles.tsv","r","utf-8") as f:
    title_list = f.read().split("\n")[:-1]

    for row in title_list:

        sr      = row.lower().split("\t")

        diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
        lev     = Levenshtein.ratio(sr[3], sr[4]) 
        sor     = 1 - distance.sorensen(sr[3], sr[4])
        jac     = 1 - distance.jaccard(sr[3], sr[4])

        print diffl, lev, sor, jac

Solution 4

I would use Levenshtein distance, or the so-called Damerau distance (which takes transpositions into account) rather than the difflib stuff for two reasons (1) "fast enough" (dynamic programming algo) and "whoooosh" (bit-bashing) C code is available and (2) well-understood behaviour e.g. Levenshtein satisfies the triangle inequality and thus can be used in e.g. a Burkhard-Keller tree.

Threshold: you should treat as "positive" only those cases where distance < (1 - X) * max(len(string1), len(string2)) and adjust X (the similarity factor) to suit yourself. One way of choosing X is to get a sample of matches, calculate X for each, ignore cases where X < say 0.8 or 0.9, then sort the remainder in descending order of X and eye-ball them and insert the correct result and calculate some cost-of-mistakes measure for various levels of X.

N.B. Your ape/apple example has distance 2, so X is 0.6 ... I would only use a threshold as low as 0.75 if I were desperately looking for something and had a high false-negative penalty

Solution 5

Is that what you mean?

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

look at http://docs.python.org/library/difflib.html#difflib.get_close_matches

Share:
49,316
agiliq
Author by

agiliq

We are agiliq. We build amazing web apps. @agiliqdotcomgooglefacebooklinkedin

Updated on July 09, 2022

Comments

  • agiliq
    agiliq almost 2 years

    I want to find string similarity between two strings. This page has examples of some of them. Python has an implemnetation of Levenshtein algorithm. Is there a better algorithm, (and hopefully a python library), under these contraints.

    1. I want to do fuzzy matches between strings. eg matches('Hello, All you people', 'hello, all You peopl') should return True
    2. False negatives are acceptable, False positives, except in extremely rare cases are not.
    3. This is done in a non realtime setting, so speed is not (much) of concern.
    4. [Edit] I am comparing multi word strings.

    Would something other than Levenshtein distance(or Levenshtein ratio) be a better algorithm for my case?

  • agiliq
    agiliq over 14 years
    Thank you. This will probably give me some good ideas, but not what I am looking for get_close_matches('appel', ['ape', 'peach', 'puppy']) gets me ape. I can set the cutoff, but from some quick experiments, false-positives are far too common.
  • Feyzi Bagirov
    Feyzi Bagirov almost 7 years
    I am getting "IndexError: list index out of range" error when running this. Why am I getting it?
  • duhaime
    duhaime almost 7 years
    @FeyziBagirov can you post a github gist with your script and input?