Implementation of Levenshtein distance for mysql/fuzzy search?

47,490

Solution 1

In order to efficiently search using levenshtein distance, you need an efficient, specialised index, such as a bk-tree. Unfortunately, no database system I know of, including MySQL, implements bk-tree indexes. This is further complicated if you're looking for full-text search, instead of just a single term per row. Off-hand, I can't think of any way that you could do full-text indexing in a manner that allows for searching based on levenshtein distance.

Solution 2

There is a mysql UDF implementation of Levenshtein Distance function

https://github.com/jmcejuela/Levenshtein-MySQL-UDF

It is implemented in C and has better performance than the "MySQL Levenshtein distance query" mentioned by schnaader

Solution 3

An implementation for the damerau-levenshtein distance can be found here: Damerau-Levenshtein algorithm: Levenshtein with transpositions The improvement over pure Levenshtein distance is that the swapping of characters is considered. I found it in the comments of schnaader's link, thanks!

Solution 4

The function given for levenshtein <= 1 above is not right -- it gives incorrect results for e.g., "bed" and "bid".

I modified the "MySQL Levenshtein distance query" given above, in the first answer, to accept a "limit" that will speed it up a little. Basically, if you only care about Levenshtein <= 1, set the limit to "2" and the function will return the exact levenshtein distance if it is 0 or 1; or a 2 if the exact levenshtein distance is 2 or greater.

This mod makes it 15% to 50% faster -- the longer your search word, the bigger the advantage (because the algorithm can bail earlier.) For instance, on a search against 200,000 words to find all matches within distance 1 of the word "giggle," the original takes 3 min 47 sec on my laptop, versus 1:39 for the "limit" version. Of course, these are both too slow for any real-time use.

Code:

DELIMITER $$
CREATE FUNCTION levenshtein_limit_n( s1 VARCHAR(255), s2 VARCHAR(255), n INT) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost, c_min INT; 
    DECLARE s1_char CHAR; 
    -- max strlen=255 
    DECLARE cv0, cv1 VARBINARY(256); 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0, c_min = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len and c_min < n DO -- if actual levenshtein dist >= limit, don't bother computing it
        SET s1_char = SUBSTRING(s1, i, 1), c = i, c_min = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
            IF c < c_min THEN
              SET c_min = c;
            END IF; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF;
    IF i <= s1_len THEN -- we didn't finish, limit exceeded    
      SET c = c_min; -- actual distance is >= c_min (i.e., the smallest value in the last computed row of the matrix) 
    END IF;
    RETURN c;
  END$$

Solution 5

Based on Chella's answer and Ryan Ginstrom's article, a fuzzy search could be implemented as so:

DELIMITER $$
CREATE FUNCTION fuzzy_substring( s1 VARCHAR(255), s2 VARCHAR(255) )
    RETURNS INT
    DETERMINISTIC
BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR;
    -- max strlen=255
    DECLARE cv0, cv1 VARBINARY(256);
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
    IF s1 = s2 THEN
        RETURN 0;
    ELSEIF s1_len = 0 THEN
        RETURN s2_len;
    ELSEIF s2_len = 0 THEN
        RETURN s1_len;
    ELSE
        WHILE j <= s2_len DO
            SET cv1 = CONCAT(cv1, UNHEX(HEX(0))), j = j + 1;
        END WHILE;
        WHILE i <= s1_len DO
            SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
            WHILE j <= s2_len DO
                SET c = c + 1;
                IF s1_char = SUBSTRING(s2, j, 1) THEN
                    SET cost = 0; ELSE SET cost = 1;
                END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
                IF c > c_temp THEN SET c = c_temp; END IF;
                    SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
                IF c > c_temp THEN
                    SET c = c_temp;
                END IF;
                SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
            END WHILE;
            SET cv1 = cv0, i = i + 1;
        END WHILE;
    END IF;
    SET j = 1;
    WHILE j <= s2_len DO
        SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10);
        IF c > c_temp THEN
            SET c = c_temp;
        END IF;
        SET j = j + 1;
    END WHILE;
    RETURN c;
END$$
DELIMITER ;
Share:
47,490
Andrew Clark
Author by

Andrew Clark

I have been developing professionally for over 10 years. I spent a fair amount of time as a full stack developer before focusing in on product and front end development.

Updated on July 09, 2022

Comments

  • Andrew Clark
    Andrew Clark almost 2 years

    I would like to be able to search a table as follows for smith as get everything that it within 1 variance.

    Data:

    O'Brien
    Smithe
    Dolan
    Smuth
    Wong
    Smoth
    Gunther
    Smiht
    

    I have looked into using Levenshtein distance does anyone know how to implement this with it?

  • Andrew Clark
    Andrew Clark about 15 years
    unfortunately this result in it being 10% slower. I have however implemented the string length, he proposes using string at max or smaller, I have implemented a compare on only string +/- 1 length.
  • max
    max over 11 years
    Sorry for the noob question but when I copy this to a text file leven, and then run \. leven, I get multiple errors from MySQL 5: ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server... near '' at line 4.
  • srayner
    srayner almost 9 years
    Is this fast enough for real time use, when searching say 200,000 records?
  • Hongzheng
    Hongzheng almost 9 years
    I am not sure what do you mean by real-time. On a test box with two Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz CPUs and 64G memory, the following queries finished in 0.30 seconds. 'select min(levenshtein(country, 'GC')) from countries;'. The countries table has a column country of 2 chars. And the table contains 1M rows +
  • Mark Fisher
    Mark Fisher over 7 years
    Do you mean at least 1?
  • AbcAeffchen
    AbcAeffchen over 7 years
    @MarkFisher No. it returns 1 (true) if the distance is lower or equal to 1.
  • talsibony
    talsibony over 5 years
    @Hongzheng its only on 2 letters try benchmarking with higher numbers
  • Pablo Pazos
    Pablo Pazos over 3 years
    @talsibony why not trying yourself?