Matching similar city names in SQL

12,390

Solution 1

If you implement the Levenshtein Distance algorithm as a user-defined function, it will return the number of operations that need to be performed on string_1 so that it becomes string_2. You can then compare the result of the Levenshtein Distance function against a fixed threshold, or against a percentage length of string_1 or string_2.

You would simply use it as follows:

WHERE LD(city_1, city_2) < 4;

Using Full-Text Search may be another option, especially since an implementation of Levenshtein Distance would require a full table scan. This decision may depend on how frequently you intend to do this comparison.

You may want to check out the following Levenshtein Distance implementation for SQL Server:

Solution 2

You could use Soundex to compare two strings that are spelt different but have a similar pronounciation.

It depends how they are misspelt. If it is just typos, probably use Levenshtein Distance that Daniel Vassallo recommends. If it is misspellings by people who weren't sure how the city was spelt, use Soundex.

Maybe use both!

Solution 3

The SoundEx function would be the best option for such scenarios, but only works when the vowels in a word are incorrect or absent. If the consonants mismatch, it would not work. Another approach to do this would be to write a simple logic of defining the appropriate mismatch limit between two words; though would not a give 100% accuracy, might solve the purpose. A simple scalar valued function which uses the SoundeEx function - internally, should be sufficient enough.

Solution 4

The best solution is to use SOUNDEX. I tried some test: It matches Waterland, Witerland but not Wiperland. I think this should fulfill your requirements. SOUNDEX converts an alpha string to a four-character code to find similar-sounding words or names.

select * from HotelSourceMap where SOUNDEX([city]) = SOUNDEX('Waterland')

==> Match

select * from HotelSourceMap where SOUNDEX([city]) = SOUNDEX('Witerland')

==> Match

select * from HotelSourceMap where SOUNDEX([city]) = SOUNDEX('Wiperland')

==> No Match

Share:
12,390

Related videos on Youtube

Dr. Rajesh Rolen
Author by

Dr. Rajesh Rolen

Over 15 years of successful experience in development of multi-tier applications and system integration solutions as application architect, project leader, technical lead, and software engineer. Very good understanding of application analysis and design concepts. Strong ability to apply proven design patterns to the system design and employ extreme programming and SCRUM techniques to the robust implementations. PhD (Computer Science &amp; Engineering), MCA, BCA, MCTS, MCP, SCJP. Experience of working in the complete Software development life cycle involving SRS, Software architecture design, database design, Code Reviews, development, and documentation. Capable to delve into the new leading Technologies. Ability to work well in both a team environment and individual environment. Ability to train the team. Areas of Expertise Strong in Architecture design, Design Principles, Design Patterns and OOPs concepts. Capable of developing cross-platform code and managing project. Web applications and web services development using ASP.NET MVC with C#.NET/ VB.NET. Client-Server based applications developed using C#.NET and VB.NET Strong in Business requirement analysis and functional specification design and documentation. Through knowledge of Object Oriented Analysis and Design OOAD and N - Tier Architecture Strong in front-end GUI development using ASP.Net, HTML, JavaScript, JQuery. Strong in backend database development including designing and administering databases, - writing stored procedures, SQL and triggers for SQL Server, Oracle, and My-SQL databases. Strong Analytical Skills. Strong oral and written communication skills Download CV

Updated on June 04, 2022

Comments

  • Dr. Rajesh Rolen
    Dr. Rajesh Rolen almost 2 years

    I have a table "City" which contains city names, and I have a another table which I just created and contains cities from different sources. When I run a query to match the cities between the two tables I find about 5000 mismatches.

    So please give some queries which I can use to match cities (because sometimes users enter city names with one or two character different)... I have created a query which is working fine but I need such a query to match more.

    Please suggest me what to do in such a situation.

    SELECT distinct hsm.countryname,co.countryname,hsm.city,co.city
    FROM   HotelSourceMap AS hsm
    INNER  JOIN 
        (  SELECT c.*,cu.countryName
           FROM   city c
           INNER  JOIN  country cu ON c.countryid= cu.countryId
        ) co
    ON (charindex(co.city,hsm.city) > 0 AND hsm.countryid = co.countryid) AND
        hsm.cityid is null
    
    • Dr. Rajesh Rolen
      Dr. Rajesh Rolen over 14 years
      hi hubens nice to see u after so many days.
  • Dr. Rajesh Rolen
    Dr. Rajesh Rolen over 14 years
    i have created function Levenshtein as u said but its giving me error when i am using it Cannot find either column "dbo" or the user-defined function or aggregate "dbo.MIN3", or the name is ambiguous. i am using it in this way (as u said) SELECT hsm.* FROM HotelSourceMap hsm, city c WHERE dbo.LEVENSHTEIN(hsm.city, c.city) < 4
  • Daniel Vassallo
    Daniel Vassallo over 14 years
    Yes, I forgot to mention that. You need to define the MIN3 function. Use this small script: tek-tips.com/viewthread.cfm?qid=1194707. Call it MIN3 instead of fnMin3.
  • Dr. Rajesh Rolen
    Dr. Rajesh Rolen over 14 years
    now tell me which one i need to call fnmin3 or Levenshtein.. please explain clearly
  • Dr. Rajesh Rolen
    Dr. Rajesh Rolen over 14 years
    its not matching correctly if i try < 4 ..... so now i am tring for <2 ...thanks a lot...
  • Dr. Rajesh Rolen
    Dr. Rajesh Rolen over 14 years
    How much time it will take to finish?
  • Daniel Vassallo
    Daniel Vassallo over 14 years
    I'm glad it helped. You may need to tweak the threshold a bit, as you noted. You may also want to consider using a variable threshold based on the average length of the two cities, so that a long city name can have more typos than a short one.
  • Sharique
    Sharique almost 14 years
    soundex works well for single word but not for multi-word name like select SOUNDEX('new york'),SOUNDEX('new delhi')
  • Dr. Rajesh Rolen
    Dr. Rajesh Rolen over 12 years
    @Mongus: Thanks, really very helpful for me.