Find closest match to input string in a list of strings
Solution 1
Edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other.
Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
Fast, memory efficient Levenshtein algorithm
using System;
/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
class Program
{
static void Main()
{
Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
}
}
Solution 2
.NET does not supply anything out of the box - you need to implement a an Edit Distance algorithm yourself. For example, you can use Levenshtein Distance, like this:
// This code is an implementation of the pseudocode from the Wikipedia,
// showing a naive implementation.
// You should research an algorithm with better space complexity.
public static int LevenshteinDistance(string s, string t) {
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
if (n == 0) {
return m;
}
if (m == 0) {
return n;
}
for (int i = 0; i <= n; d[i, 0] = i++)
;
for (int j = 0; j <= m; d[0, j] = j++)
;
for (int i = 1; i <= n; i++) {
for (int j = 1; j <= m; j++) {
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}
Call LevenshteinDistance(targetString, possible[i])
for each i
, then pick the string possible[i]
for which LevenshteinDistance
returns the smallest value.
Related videos on Youtube
Admin
Updated on July 09, 2022Comments
-
Admin almost 2 years
I have problems finding an implementation of closest match strings for .net
I would like to match a list of strings, example:
input string: "Publiczna Szkoła Podstawowa im. Bolesława Chrobrego w Wąsoszu"
List of strings:
Publiczna Szkoła Podstawowa im. B. Chrobrego w Wąsoszu
Szkoła Podstawowa Specjalna
Szkoła Podstawowa im.Henryka Sienkiewicza w Wąsoszu
Szkoła Podstawowa im. Romualda Traugutta w Wąsoszu Górnym
This would clearly need to be matched with "Publiczna Szkoła Podstawowa im. B. Chrobrego w Wąsoszu".
What algorithms are there available for .net?
-
mike about 5 yearsVery fast! Your code's results validated 100% against my implementation, but yours ran 2x faster than what I came up with.
-
Ali123 almost 3 yearsDoes this work with comparing sentences with spaces and special characters?
-
Ali123 almost 3 yearsThis just finds the closest in length to the given string. If i pass part of the word to a list of strings, it will choose the one that is closest in length rather than the one that has the actual string.
-
Sergey Kalinichenko almost 3 years@Ali123 Wrong: this algorithm finds the closest string by edit distance, not by length. If you do not understand what this means, there is a link at the top of the answer to help you.
-
Ali123 almost 3 yearsI tried it with an option that has the actual keyword that i am searching for but it ignored it and chose the shortest sentence. See this example where i used the fuzzySearch plugin which gave the highest value that i am looking for while the function you kindly provided choose the shortest string. dotnetfiddle.net/dy5rm5
-
Sergey Kalinichenko almost 3 years@Ali123 Let's do it by hand: to go from "transcribe" to "ecm" you need to delete nine characters (t, r, a, n, s, c, r, i, and b) and then insert two more characters at the end (c and m), for a total score of 9+2 = 11. To go from "open form for ecm" to just "ecm" you need to delete everything before "ecm", which comes to 14 characters in total. 11 is lower than 14, so in terms of the edit distance "transcribe" wins.
-
Ali123 almost 3 yearsi understand that. Your function works perfectly, but it's not suitable for my needs. is "ECM" is closer to "Open form for ECM" than "transcribe" phonetically, which is what i needed in my case. Levenshtein Distance might be sutiable for auto correction and other use cases. Thanks again.