How to find best fuzzy match for a string in a large string database

15,215

Solution 1

This paper seems to describe exactly what you want.

Lucene (http://lucene.apache.org/) also implements Levenshtein edit distance.

Solution 2

You didn't mention your database system, but for PostrgreSQL you could use the following contrib module: trgm - Trigram matching for PostgreSQL

The pg_trgm contrib module provides functions and index classes for determining the similarity of text based on trigram matching.

Solution 3

If your database supports it, you should use full-text search. Otherwise, you can use an indexer like lucene and its various implementations.

Share:
15,215
guillermooo
Author by

guillermooo

Time you enjoyed wasting is not wasted time.

Updated on June 11, 2022

Comments

  • guillermooo
    guillermooo about 2 years

    I have a database of strings (arbitrary length) which holds more than one million items (potentially more).

    I need to compare a user-provided string against the whole database and retrieve an identical string if it exists or otherwise return the closest fuzzy match(es) (60% similarity or better). The search time should ideally be under one second.

    My idea is to use edit distance for comparing each db string to the search string after narrowing down the candidates from the db based on their length.

    However, as I will need to perform this operation very often, I'm thinking about building an index of the db strings to keep in memory and query the index, not the db directly.

    Any ideas on how to approach this problem differently or how to build the in-memory index?

  • fmunshi
    fmunshi almost 15 years
    The first link appears to have gone. :-/
  • Peter Burns
    Peter Burns over 14 years
    I emailed a contact, to see if we can track down zarawesome and fix this link. Unfortunately no direct email was provided, so..
  • zaratustra
    zaratustra over 14 years
    Sorry, yeah, I don't remember what the paper was about. I suggest you search for "Levenshtein edit distance" and see what comes up.
  • reinierpost
    reinierpost over 14 years
    Soundex cannot see through many misspellings or other variants. It works well on names but not on arbitrary strings.
  • Oddthinking
    Oddthinking over 14 years
    Interesting. I didn't know it was focussed on names. I knew NYIIS was. (en.wikipedia.org/wiki/…)
  • krishnakumarp
    krishnakumarp over 8 years
    The paper is "Fast String Correction with Levenshtein-Automata" by Klaus U. Schulz , Stoyan Mihov. CiteSeerx link is given below. citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940