Find closest match to input string in a list of strings

20,179

Solution 1

Edit distance

Edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other.

Levenshtein distance

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

Fast, memory efficient Levenshtein algorithm

C# Levenshtein

using System;

/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
    /// <summary>
    /// Compute the distance between two strings.
    /// </summary>
    public static int Compute(string s, string t)
    {
    int n = s.Length;
    int m = t.Length;
    int[,] d = new int[n + 1, m + 1];

    // Step 1
    if (n == 0)
    {
        return m;
    }

    if (m == 0)
    {
        return n;
    }

    // Step 2
    for (int i = 0; i <= n; d[i, 0] = i++)
    {
    }

    for (int j = 0; j <= m; d[0, j] = j++)
    {
    }

    // Step 3
    for (int i = 1; i <= n; i++)
    {
        //Step 4
        for (int j = 1; j <= m; j++)
        {
        // Step 5
        int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

        // Step 6
        d[i, j] = Math.Min(
            Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
            d[i - 1, j - 1] + cost);
        }
    }
    // Step 7
    return d[n, m];
    }
}

class Program
{
    static void Main()
    {
    Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
    Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
    Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
    }
}

Solution 2

.NET does not supply anything out of the box - you need to implement a an Edit Distance algorithm yourself. For example, you can use Levenshtein Distance, like this:

// This code is an implementation of the pseudocode from the Wikipedia,
// showing a naive implementation.
// You should research an algorithm with better space complexity.
public static int LevenshteinDistance(string s, string t) {
    int n = s.Length;
    int m = t.Length;
    int[,] d = new int[n + 1, m + 1];
    if (n == 0) {
        return m;
    }
    if (m == 0) {
        return n;
    }
    for (int i = 0; i <= n; d[i, 0] = i++)
        ;
    for (int j = 0; j <= m; d[0, j] = j++)
       ;
    for (int i = 1; i <= n; i++) {
        for (int j = 1; j <= m; j++) {
            int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
            d[i, j] = Math.Min(
                Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                d[i - 1, j - 1] + cost);
        }
    }
    return d[n, m];
}

Call LevenshteinDistance(targetString, possible[i]) for each i, then pick the string possible[i] for which LevenshteinDistance returns the smallest value.

Share:
20,179

Related videos on Youtube

Admin
Author by

Admin

Updated on July 09, 2022

Comments

  • Admin
    Admin almost 2 years

    I have problems finding an implementation of closest match strings for .net

    I would like to match a list of strings, example:

    input string: "Publiczna Szkoła Podstawowa im. Bolesława Chrobrego w Wąsoszu"

    List of strings:

    Publiczna Szkoła Podstawowa im. B. Chrobrego w Wąsoszu

    Szkoła Podstawowa Specjalna

    Szkoła Podstawowa im.Henryka Sienkiewicza w Wąsoszu

    Szkoła Podstawowa im. Romualda Traugutta w Wąsoszu Górnym

    This would clearly need to be matched with "Publiczna Szkoła Podstawowa im. B. Chrobrego w Wąsoszu".

    What algorithms are there available for .net?

  • mike
    mike about 5 years
    Very fast! Your code's results validated 100% against my implementation, but yours ran 2x faster than what I came up with.
  • Ali123
    Ali123 almost 3 years
    Does this work with comparing sentences with spaces and special characters?
  • Ali123
    Ali123 almost 3 years
    This just finds the closest in length to the given string. If i pass part of the word to a list of strings, it will choose the one that is closest in length rather than the one that has the actual string.
  • Sergey Kalinichenko
    Sergey Kalinichenko almost 3 years
    @Ali123 Wrong: this algorithm finds the closest string by edit distance, not by length. If you do not understand what this means, there is a link at the top of the answer to help you.
  • Ali123
    Ali123 almost 3 years
    I tried it with an option that has the actual keyword that i am searching for but it ignored it and chose the shortest sentence. See this example where i used the fuzzySearch plugin which gave the highest value that i am looking for while the function you kindly provided choose the shortest string. dotnetfiddle.net/dy5rm5
  • Sergey Kalinichenko
    Sergey Kalinichenko almost 3 years
    @Ali123 Let's do it by hand: to go from "transcribe" to "ecm" you need to delete nine characters (t, r, a, n, s, c, r, i, and b) and then insert two more characters at the end (c and m), for a total score of 9+2 = 11. To go from "open form for ecm" to just "ecm" you need to delete everything before "ecm", which comes to 14 characters in total. 11 is lower than 14, so in terms of the edit distance "transcribe" wins.
  • Ali123
    Ali123 almost 3 years
    i understand that. Your function works perfectly, but it's not suitable for my needs. is "ECM" is closer to "Open form for ECM" than "transcribe" phonetically, which is what i needed in my case. Levenshtein Distance might be sutiable for auto correction and other use cases. Thanks again.