Matching similar city names in SQL
Solution 1
If you implement the Levenshtein Distance algorithm as a user-defined function, it will return the number of operations that need to be performed on string_1 so that it becomes string_2. You can then compare the result of the Levenshtein Distance function against a fixed threshold, or against a percentage length of string_1 or string_2.
You would simply use it as follows:
WHERE LD(city_1, city_2) < 4;
Using Full-Text Search may be another option, especially since an implementation of Levenshtein Distance would require a full table scan. This decision may depend on how frequently you intend to do this comparison.
You may want to check out the following Levenshtein Distance implementation for SQL Server:
Solution 2
You could use Soundex to compare two strings that are spelt different but have a similar pronounciation.
It depends how they are misspelt. If it is just typos, probably use Levenshtein Distance that Daniel Vassallo recommends. If it is misspellings by people who weren't sure how the city was spelt, use Soundex.
Maybe use both!
Solution 3
The SoundEx function would be the best option for such scenarios, but only works when the vowels in a word are incorrect or absent. If the consonants mismatch, it would not work. Another approach to do this would be to write a simple logic of defining the appropriate mismatch limit between two words; though would not a give 100% accuracy, might solve the purpose. A simple scalar valued function which uses the SoundeEx function - internally, should be sufficient enough.
Solution 4
The best solution is to use SOUNDEX. I tried some test: It matches Waterland, Witerland but not Wiperland. I think this should fulfill your requirements. SOUNDEX converts an alpha string to a four-character code to find similar-sounding words or names.
select * from HotelSourceMap where SOUNDEX([city]) = SOUNDEX('Waterland')
==> Match
select * from HotelSourceMap where SOUNDEX([city]) = SOUNDEX('Witerland')
==> Match
select * from HotelSourceMap where SOUNDEX([city]) = SOUNDEX('Wiperland')
==> No Match
Related videos on Youtube
Dr. Rajesh Rolen
Over 15 years of successful experience in development of multi-tier applications and system integration solutions as application architect, project leader, technical lead, and software engineer. Very good understanding of application analysis and design concepts. Strong ability to apply proven design patterns to the system design and employ extreme programming and SCRUM techniques to the robust implementations. PhD (Computer Science & Engineering), MCA, BCA, MCTS, MCP, SCJP. Experience of working in the complete Software development life cycle involving SRS, Software architecture design, database design, Code Reviews, development, and documentation. Capable to delve into the new leading Technologies. Ability to work well in both a team environment and individual environment. Ability to train the team. Areas of Expertise Strong in Architecture design, Design Principles, Design Patterns and OOPs concepts. Capable of developing cross-platform code and managing project. Web applications and web services development using ASP.NET MVC with C#.NET/ VB.NET. Client-Server based applications developed using C#.NET and VB.NET Strong in Business requirement analysis and functional specification design and documentation. Through knowledge of Object Oriented Analysis and Design OOAD and N - Tier Architecture Strong in front-end GUI development using ASP.Net, HTML, JavaScript, JQuery. Strong in backend database development including designing and administering databases, - writing stored procedures, SQL and triggers for SQL Server, Oracle, and My-SQL databases. Strong Analytical Skills. Strong oral and written communication skills Download CV
Updated on June 04, 2022Comments
-
Dr. Rajesh Rolen almost 2 years
I have a table "City" which contains city names, and I have a another table which I just created and contains cities from different sources. When I run a query to match the cities between the two tables I find about 5000 mismatches.
So please give some queries which I can use to match cities (because sometimes users enter city names with one or two character different)... I have created a query which is working fine but I need such a query to match more.
Please suggest me what to do in such a situation.
SELECT distinct hsm.countryname,co.countryname,hsm.city,co.city FROM HotelSourceMap AS hsm INNER JOIN ( SELECT c.*,cu.countryName FROM city c INNER JOIN country cu ON c.countryid= cu.countryId ) co ON (charindex(co.city,hsm.city) > 0 AND hsm.countryid = co.countryid) AND hsm.cityid is null
-
Dr. Rajesh Rolen over 14 yearshi hubens nice to see u after so many days.
-
-
Dr. Rajesh Rolen over 14 yearsi have created function Levenshtein as u said but its giving me error when i am using it Cannot find either column "dbo" or the user-defined function or aggregate "dbo.MIN3", or the name is ambiguous. i am using it in this way (as u said) SELECT hsm.* FROM HotelSourceMap hsm, city c WHERE dbo.LEVENSHTEIN(hsm.city, c.city) < 4
-
Daniel Vassallo over 14 yearsYes, I forgot to mention that. You need to define the MIN3 function. Use this small script: tek-tips.com/viewthread.cfm?qid=1194707. Call it MIN3 instead of fnMin3.
-
Dr. Rajesh Rolen over 14 yearsnow tell me which one i need to call fnmin3 or Levenshtein.. please explain clearly
-
Dr. Rajesh Rolen over 14 yearsits not matching correctly if i try < 4 ..... so now i am tring for <2 ...thanks a lot...
-
Dr. Rajesh Rolen over 14 yearsHow much time it will take to finish?
-
Daniel Vassallo over 14 yearsI'm glad it helped. You may need to tweak the threshold a bit, as you noted. You may also want to consider using a variable threshold based on the average length of the two cities, so that a long city name can have more typos than a short one.
-
Sharique almost 14 yearssoundex works well for single word but not for multi-word name like select SOUNDEX('new york'),SOUNDEX('new delhi')
-
Dr. Rajesh Rolen over 12 years@Mongus: Thanks, really very helpful for me.