Fuzzy matching using T-SQL

186,962

Solution 1

I've found that the stuff SQL Server gives you to do fuzzy matching is pretty clunky. I've had really good luck with my own CLR functions using the Levenshtein distance algorithm and some weighting. Using that algorithm, I've then made a UDF called GetSimilarityScore that takes two strings and returns a score between 0.0 and 1.0. The closer to 1.0 the match is, the better. Then, query with a threshold of >=0.8 or so to get the most likely matches. Something like this:

if object_id('tempdb..#similar') is not null drop table #similar
select a.id, (
    select top 1 x.id
   from MyTable x
   where x.id <> a.id
   order by dbo.GetSimilarityScore(a.MyField, x.MyField) desc
) as MostSimilarId
into #similar
from MyTable a

select *, dbo.GetSimilarityScore(a.MyField, c.MyField)
from MyTable a
join #similar b on a.id = b.id
join MyTable c on b.MostSimilarId = c.id

Just don't do it with really large tables. It's a slow process.

Here's the CLR UDFs:

''' <summary>
''' Compute the distance between two strings.
''' </summary>
''' <param name="s1">The first of the two strings.</param>
''' <param name="s2">The second of the two strings.</param>
''' <returns>The Levenshtein cost.</returns>
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function ComputeLevenstheinDistance(ByVal string1 As SqlString, ByVal string2 As SqlString) As SqlInt32
    If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null
    Dim s1 As String = string1.Value
    Dim s2 As String = string2.Value

    Dim n As Integer = s1.Length
    Dim m As Integer = s2.Length
    Dim d As Integer(,) = New Integer(n, m) {}

    ' Step 1
    If n = 0 Then Return m
    If m = 0 Then Return n

    ' Step 2
    For i As Integer = 0 To n
        d(i, 0) = i
    Next

    For j As Integer = 0 To m
        d(0, j) = j
    Next

    ' Step 3
    For i As Integer = 1 To n
        'Step 4
        For j As Integer = 1 To m
            ' Step 5
            Dim cost As Integer = If((s2(j - 1) = s1(i - 1)), 0, 1)

            ' Step 6
            d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
        Next
    Next
    ' Step 7
    Return d(n, m)
End Function

''' <summary>
''' Returns a score between 0.0-1.0 indicating how closely two strings match.  1.0 is a 100%
''' T-SQL equality match, and the score goes down from there towards 0.0 for less similar strings.
''' </summary>
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function GetSimilarityScore(string1 As SqlString, string2 As SqlString) As SqlDouble
    If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null

    Dim s1 As String = string1.Value.ToUpper().TrimEnd(" "c)
    Dim s2 As String = string2.Value.ToUpper().TrimEnd(" "c)
    If s1 = s2 Then Return 1.0F ' At this point, T-SQL would consider them the same, so I will too

    Dim flatLevScore As Double = InternalGetSimilarityScore(s1, s2)

    Dim letterS1 As String = GetLetterSimilarityString(s1)
    Dim letterS2 As String = GetLetterSimilarityString(s2)
    Dim letterScore As Double = InternalGetSimilarityScore(letterS1, letterS2)

    'Dim wordS1 As String = GetWordSimilarityString(s1)
    'Dim wordS2 As String = GetWordSimilarityString(s2)
    'Dim wordScore As Double = InternalGetSimilarityScore(wordS1, wordS2)

    If flatLevScore = 1.0F AndAlso letterScore = 1.0F Then Return 1.0F
    If flatLevScore = 0.0F AndAlso letterScore = 0.0F Then Return 0.0F

    ' Return weighted result
    Return (flatLevScore * 0.2F) + (letterScore * 0.8F)
End Function

Private Shared Function InternalGetSimilarityScore(s1 As String, s2 As String) As Double
    Dim dist As SqlInt32 = ComputeLevenstheinDistance(s1, s2)
    Dim maxLen As Integer = If(s1.Length > s2.Length, s1.Length, s2.Length)
    If maxLen = 0 Then Return 1.0F
    Return 1.0F - Convert.ToDouble(dist.Value) / Convert.ToDouble(maxLen)
End Function

''' <summary>
''' Sorts all the alpha numeric characters in the string in alphabetical order
''' and removes everything else.
''' </summary>
Private Shared Function GetLetterSimilarityString(s1 As String) As String
    Dim allChars = If(s1, "").ToUpper().ToCharArray()
    Array.Sort(allChars)
    Dim result As New StringBuilder()
    For Each ch As Char In allChars
        If Char.IsLetterOrDigit(ch) Then
            result.Append(ch)
        End If
    Next
    Return result.ToString()
End Function

''' <summary>
''' Removes all non-alpha numeric characters and then sorts
''' the words in alphabetical order.
''' </summary>
Private Shared Function GetWordSimilarityString(s1 As String) As String
    Dim words As New List(Of String)()
    Dim curWord As StringBuilder = Nothing
    For Each ch As Char In If(s1, "").ToUpper()
        If Char.IsLetterOrDigit(ch) Then
            If curWord Is Nothing Then
                curWord = New StringBuilder()
            End If
            curWord.Append(ch)
        Else
            If curWord IsNot Nothing Then
                words.Add(curWord.ToString())
                curWord = Nothing
            End If
        End If
    Next
    If curWord IsNot Nothing Then
        words.Add(curWord.ToString())
    End If

    words.Sort(StringComparer.OrdinalIgnoreCase)
    Return String.Join(" ", words.ToArray())
End Function

Solution 2

In addition to the other good info here, you might want to consider using the Double Metaphone phonetic algorithm which is generally considered to be better than SOUNDEX.

Tim Pfeiffer details an implementation in SQL in his article Double Metaphone Sounds Great Convert the C++ Double Metaphone algorithm to T-SQL (originally in SQL Mag & then in SQL Server Pro).

That will assist in matching names with slight misspellings, e.g., Carl vs. Karl.

Update: The actual downloadable code seems to be gone, but here's an implementation found on a github repo that appears to have cloned the original code

Solution 3

Since the first release of Master Data Services, you've got access to more advanced fuzzy logic algorithms than what SOUNDEX implements. So provided that you've got MDS installed, you'll be able to find a function called Similarity() in the mdq schema (MDS database).

More info on how it works: http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/

Solution 4

I would use SQL Server Full Text Indexing, which will allow you to do searches and return things that not only contain the word but also may have a misspelling.

Solution 5

I personally use a CLR implementation of the Jaro-Winkler algorithm which seems to work pretty well - it struggles a bit with strings longer than about 15 characters and doesn't like matching email addresses but otherwise is quite good - full implementation guide can be found here

Here is a sample of what that looks like in t-sql

SELECT * 
FROM myTable1 as t1 
INNER JOIN myTable2 as t2 
ON dbo.StringDistance(t1.MyField, t2.MyField) > 0.85

An advantage of this approach over soundex is that this can more easily handle typos, because if the typo produces different sound then SOUNDEX will not match it correctly.

For example, if you typed “ytpe” instead of “type” then SOUNDEX will not give you a correct match. The codes returned for SOUNDEX(‘ytpe’) and SOUNDEX(‘type’) are the following:

ytpe Y310

type T100

If I use Jaro-Winkler StringDistance(‘ytpe’,’type’) here is what I get 0.91(6), which means it is a good match. Remember 1 – is exact match, so 0.91 is very close to it.

If you are unable to use CLR functions for whatever reasons, maybe you could try running the data through an SSIS package (using the fuzzy transformation lookup) - detailed here

Share:
186,962

Related videos on Youtube

Frederik
Author by

Frederik

Updated on July 05, 2022

Comments

  • Frederik
    Frederik almost 2 years

    I have a table Persons with personaldata and so on. There are lots of columns but the once of interest here are: addressindex, lastname and firstname where addressindex is a unique address drilled down to the door of the apartment. So if I have 'like below' two persons with the lastname and one the firstnames are the same they are most likely duplicates.

    I need a way to list these duplicates.

    tabledata:
    
    personid     1
    firstname    "Carl"
    lastname     "Anderson"
    addressindex 1
    
    personid     2
    firstname    "Carl Peter"
    lastname     "Anderson"
    addressindex 1
    

    I know how do this if I were to match exactly on all columns but I need fuzzy match to do the trick with (from the above example) a result like:

    Row     personid      addressindex     lastname     firstname
    1       2             1                Anderson     Carl Peter
    2       1             1                Anderson     Carl
    .....
    

    Any hints on how to solve this in a good way?

    • dburges
      dburges about 15 years
      BTW way, it is quite likely it is NOT the same person in the case given. Fathers and sons do live together at times you know.
    • Tomalak
      Tomalak about 15 years
      This is always the problem with semi-clever address evaluation algorithms. You can make an assumption, but you can never be sure.
    • Frederik
      Frederik about 15 years
      Good point although The descition is another issue based on the result of the fuzzy match.
    • Valentino Vranken
      Valentino Vranken over 12 years
      @justSteve: I added a new answer proposing a more up-to-date possibility.
  • Frederik
    Frederik about 15 years
    Thand, i have considered it bit se use standard edition and full text search isn't an option here.
  • Russ Bradberry
    Russ Bradberry about 15 years
    Full Text Search is available in all editions of SQL Server 2005 and 2008
  • Rana Hamza
    Rana Hamza over 12 years
    Not having access to MDS and thanking that I'm not working with Big Data - this looks like a great fit. Highly appreciate the details.
  • Andreas
    Andreas over 10 years
    SQL Server Fulltext indexing does NOT support fuzzy search. When the searched text contains misspellings, it will fail!
  • D'Arcy Rittich
    D'Arcy Rittich about 10 years
    @Lèsemajesté Link fixed.
  • Mat
    Mat almost 10 years
    Full Text Indexing - via the CONTAINS() function - supports fuzzy in the sense of 'multiple terms in the vicinity of one another'. It does not support fuzzy in the sense of 'slightly different spellings'.
  • codingbadger
    codingbadger over 6 years
    That link is now dead... again
  • Philip
    Philip over 3 years
    Hi @mattmc3, how do I get that CLR UDF into SQL server?
  • D'Arcy Rittich
    D'Arcy Rittich about 3 years
    @codingbadger Works for me. It's also available in the Wayback Machine: doubleMetaphone.sql