Implementation of Levenshtein distance for mysql/fuzzy search?
Solution 1
In order to efficiently search using levenshtein distance, you need an efficient, specialised index, such as a bk-tree. Unfortunately, no database system I know of, including MySQL, implements bk-tree indexes. This is further complicated if you're looking for full-text search, instead of just a single term per row. Off-hand, I can't think of any way that you could do full-text indexing in a manner that allows for searching based on levenshtein distance.
Solution 2
There is a mysql UDF implementation of Levenshtein Distance function
https://github.com/jmcejuela/Levenshtein-MySQL-UDF
It is implemented in C and has better performance than the "MySQL Levenshtein distance query" mentioned by schnaader
Solution 3
An implementation for the damerau-levenshtein distance can be found here: Damerau-Levenshtein algorithm: Levenshtein with transpositions The improvement over pure Levenshtein distance is that the swapping of characters is considered. I found it in the comments of schnaader's link, thanks!
Solution 4
The function given for levenshtein <= 1 above is not right -- it gives incorrect results for e.g., "bed" and "bid".
I modified the "MySQL Levenshtein distance query" given above, in the first answer, to accept a "limit" that will speed it up a little. Basically, if you only care about Levenshtein <= 1, set the limit to "2" and the function will return the exact levenshtein distance if it is 0 or 1; or a 2 if the exact levenshtein distance is 2 or greater.
This mod makes it 15% to 50% faster -- the longer your search word, the bigger the advantage (because the algorithm can bail earlier.) For instance, on a search against 200,000 words to find all matches within distance 1 of the word "giggle," the original takes 3 min 47 sec on my laptop, versus 1:39 for the "limit" version. Of course, these are both too slow for any real-time use.
Code:
DELIMITER $$
CREATE FUNCTION levenshtein_limit_n( s1 VARCHAR(255), s2 VARCHAR(255), n INT)
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost, c_min INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0, c_min = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len and c_min < n DO -- if actual levenshtein dist >= limit, don't bother computing it
SET s1_char = SUBSTRING(s1, i, 1), c = i, c_min = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
IF c < c_min THEN
SET c_min = c;
END IF;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
IF i <= s1_len THEN -- we didn't finish, limit exceeded
SET c = c_min; -- actual distance is >= c_min (i.e., the smallest value in the last computed row of the matrix)
END IF;
RETURN c;
END$$
Solution 5
Based on Chella's answer and Ryan Ginstrom's article, a fuzzy search could be implemented as so:
DELIMITER $$
CREATE FUNCTION fuzzy_substring( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(0))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
SET j = 1;
WHILE j <= s2_len DO
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10);
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET j = j + 1;
END WHILE;
RETURN c;
END$$
DELIMITER ;
Andrew Clark
I have been developing professionally for over 10 years. I spent a fair amount of time as a full stack developer before focusing in on product and front end development.
Updated on July 09, 2022Comments
-
Andrew Clark almost 2 years
I would like to be able to search a table as follows for smith as get everything that it within 1 variance.
Data:
O'Brien Smithe Dolan Smuth Wong Smoth Gunther Smiht
I have looked into using Levenshtein distance does anyone know how to implement this with it?
-
Andrew Clark about 15 yearsunfortunately this result in it being 10% slower. I have however implemented the string length, he proposes using string at max or smaller, I have implemented a compare on only string +/- 1 length.
-
max over 11 yearsSorry for the noob question but when I copy this to a text file
leven
, and then run\. leven
, I get multiple errors from MySQL 5:ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server... near '' at line 4
. -
srayner almost 9 yearsIs this fast enough for real time use, when searching say 200,000 records?
-
Hongzheng almost 9 yearsI am not sure what do you mean by real-time. On a test box with two Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz CPUs and 64G memory, the following queries finished in 0.30 seconds. 'select min(levenshtein(country, 'GC')) from countries;'. The countries table has a column country of 2 chars. And the table contains 1M rows +
-
Mark Fisher over 7 yearsDo you mean at least 1?
-
AbcAeffchen over 7 years@MarkFisher No. it returns 1 (true) if the distance is lower or equal to 1.
-
talsibony over 5 years@Hongzheng its only on 2 letters try benchmarking with higher numbers
-
Pablo Pazos over 3 years@talsibony why not trying yourself?