How to find similar results and sort by similarity?

82,222

Solution 1

I have found out that the Levenshtein distance may be good when you are searching a full string against another full string, but when you are looking for keywords within a string, this method does not return (sometimes) the wanted results. Moreover, the SOUNDEX function is not suitable for languages other than english, so it is quite limited. You could get away with LIKE, but it's really for basic searches. You may want to look into other search methods for what you want to achieve. For example:

You may use Lucene as search base for your projects. It's implemented in most major programming languages and it'd quite fast and versatile. This method is probably the best, as it not only search for substrings, but also letter transposition, prefixes and suffixes (all combined). However, you need to keep a separate index (using CRON to update it from a independent script once in a while works though).

Or, if you want a MySQL solution, the fulltext functionality is pretty good, and certainly faster than a stored procedure. If your tables are not MyISAM, you can create a temporary table, then perform your fulltext search :

CREATE TABLE IF NOT EXISTS `tests`.`data_table` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `title` varchar(2000) CHARACTER SET latin1 NOT NULL,
  `description` text CHARACTER SET latin1 NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;

Use a data generator to generate some random data if you don't want to bother creating it yourself...

** NOTE ** : the column type should be latin1_bin to perform a case sensitive search instead of case insensitive with latin1. For unicode strings, I would recommend utf8_bin for case sensitive and utf8_general_ci for case insensitive searches.

DROP TABLE IF EXISTS `tests`.`data_table_temp`;
CREATE TEMPORARY TABLE `tests`.`data_table_temp`
   SELECT * FROM `tests`.`data_table`;

ALTER TABLE `tests`.`data_table_temp`  ENGINE = MYISAM;

ALTER TABLE `tests`.`data_table_temp` ADD FULLTEXT `FTK_title_description` (
  `title` ,
  `description`
);

SELECT *,
       MATCH (`title`,`description`)
       AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE) as `score`
  FROM `tests`.`data_table_temp`
 WHERE MATCH (`title`,`description`)
       AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE)
 ORDER BY `score` DESC;

DROP TABLE `tests`.`data_table_temp`;

Read more about it from the MySQL API reference page

The downside to this is that it will not look for letter transposition or "similar, sounds like" words.

** UPDATE **

Using Lucene for your search, you will simply need to create a cron job (all web hosts have this "feature") where this job will simply execute a PHP script (i.g. "cd /path/to/script; php searchindexer.php") that will update the indexes. The reason being that indexing thousands of "documents" (rows, data, etc.) may take several seconds, even minutes, but this is to ensure that all searches are performed as fast as possible. Therefore, you may want to create a delay job to be run by the server. It may be overnight, or in the next hour, this is up to you. The PHP script should look something like this:

$indexer = Zend_Search_Lucene::create('/path/to/lucene/data');

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  // change this option for your need
  new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()
);

$rowSet = getDataRowSet();  // perform your SQL query to fetch whatever you need to index
foreach ($rowSet as $row) {
   $doc = new Zend_Search_Lucene_Document();
   $doc->addField(Zend_Search_Lucene_Field::text('field1', $row->field1, 'utf-8'))
       ->addField(Zend_Search_Lucene_Field::text('field2', $row->field2, 'utf-8'))
       ->addField(Zend_Search_Lucene_Field::unIndexed('someValue', $someVariable))
       ->addField(Zend_Search_Lucene_Field::unIndexed('someObj', serialize($obj), 'utf-8'))
  ;
  $indexer->addDocument($doc);
}

// ... you can get as many $rowSet as you want and create as many documents
// as you wish... each document doesn't necessarily need the same fields...
// Lucene is pretty flexible on this

$indexer->optimize();  // do this every time you add more data to you indexer...
$indexer->commit();    // finalize the process

Then, this is basically how you search (basic search) :

$index = Zend_Search_Lucene::open('/path/to/lucene/data');

// same search options
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
   new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()
);

Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');

$query = 'php +field1:foo';  // search for the word 'php' in any field,
                                 // +search for 'foo' in field 'field1'

$hits = $index->find($query);

$numHits = count($hits);
foreach ($hits as $hit) {
   $score = $hit->score;  // the hit weight
   $field1 = $hit->field1;
   // etc.
}

Here are great sites about Lucene in Java, PHP, and .Net.

In conclusion each search methods have their own pros and cons :

  • You mentioned Sphinx search and it looks very good, as long as you can make the deamon run on your web host.
  • Zend Lucene requires a cron job to re-index the database. While it is quite transparent to the user, this means that any new data (or deleted data!) is not always in sync with the data in your database and therefore won't show up right away on user search.
  • MySQL FULLTEXT search is good and fast, but will not give you all the power and flexibility of the first two.

Please feel free to comment if I have forgotten/missed anything.

Solution 2

1. Similarity

For Levenshtein in MySQL I found this, from www.codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function

SELECT 
    column, 
    LEVENSHTEIN(column, 'search_string') AS distance 
FROM table 
WHERE 
    LEVENSHTEIN(column, 'search_string') < distance_limit
ORDER BY distance DESC

2. Containing, case insensitive

Use the LIKE statement of MySQL, which is case insensitive by default. The % is a wildcard, so there may be any string before and after search_string.

SELECT 
    *
FROM 
    table
WHERE 
    column_name LIKE "%search_string%"

3. Containing, case sensitive

The MySQL Manual helps:

The default character set and collation are latin1 and latin1_swedish_ci, so nonbinary string comparisons are case insensitive by default. This means that if you search with col_name LIKE 'a%', you get all column values that start with A or a. To make this search case sensitive, make sure that one of the operands has a case sensitive or binary collation. For example, if you are comparing a column and a string that both have the latin1 character set, you can use the COLLATE operator to cause either operand to have the latin1_general_cs or latin1_bin collation...

My MySQL setup does not support latin1_general_cs or latin1_bin, but it worked fine for me to use the collation utf8_bin as binary utf8 is case sensitive:

SELECT 
    *
FROM 
    table
WHERE 
    column_name LIKE "%search_string%" COLLATE utf8_bin

2. / 3. sorted by Levenshtein Distance

SELECT 
    column, 
    LEVENSHTEIN(column, 'search_string') AS distance // for sorting
FROM table 
WHERE 
    column_name LIKE "%search_string%"
    COLLATE utf8_bin // for case sensitivity, just leave out for CI
ORDER BY
    distance
    DESC

Solution 3

It seems that your definition of similarity is semantic similarity. So in order to build such a similarity function, you should use semantic similarity measures. Note that the scope of work on the issue might vary from few hours to years so it is recommended to decide on the scope before getting into work. I didn’t figure out which data do you have in order to build the similarity relation. I assume the you have access the a dataset of documents and a dataset of queries. You can start with co-occurrence of the words (e.g., conditional probability). You will discover quickly that you get the list of stop words as related the most of the words simply because they are very popular. Using the lift of conditional probability will take care of the stop words but will make the relation prone to error in small number (most of your cases). You might try Jacard but since it is symmetric there will be many relations it won't find. Then you might consider relations that appear only in short distance from the base word. You can (and should) consider relations base on general corpus's (e.g., Wikipedia) and user specific (e.g., his emails).

Very shortly you will have plenty of similarity measures, when all the measures are good and have some advantage over the others.

In order to combine such measures, I like to reduce the problem into a classification problem.

You should build a data set of paris of words and label them as "is related". In order to build a large labeled dataset you can:

  • Use sources of known related words (e.g., good old Wikipedia categories) for positives
  • Most of the word not known as related are not related.

Then use all the measures you have as features of the pairs. Now you are in the domain of supervised classification problem. Build a classifier on the data set, evaluated according to your needs and get a similarity measure that fits your needs.

Share:
82,222
Robin Rodricks
Author by

Robin Rodricks

Updated on July 05, 2022

Comments

  • Robin Rodricks
    Robin Rodricks almost 2 years

    How do I query for records ordered by similarity?

    Eg. searching for "Stock Overflow" would return

    1. Stack Overflow
    2. SharePoint Overflow
    3. Math Overflow
    4. Politic Overflow
    5. VFX Overflow

    Eg. searching for "LO" would return:

    1. pabLO picasso
    2. michelangeLO
    3. jackson polLOck

    What I need help with:

    1. Using a search engine to index & search a MySQL table, for better results

      • Using the Sphinx search engine, with PHP

      • Using the Lucene engine with PHP

    2. Using full-text indexing, to find similar/containing strings


    What does not work well

    • Levenshtein distance is very erratic. (UDF, Query)
      Searching for "dog" gives me:
      1. dog
      2. bog
      3. ago
      4. big
      5. echo
    • LIKE returns better results, but returns nothing for long queries although similar strings do exist
      1. dog
      2. dogid
      3. dogaral
      4. dogma
  • opatut
    opatut over 13 years
    How do you define similarity when checking if the searched string occurs in the column? There are 2 possibilities: TRUE and FALSE, nothing in between. You could actually get a factor by dividing the string length of your search string by the string length of the column, but you will always get the shortest string on top -- Do you want to sort by number of occurences in the actual column? Why not fulltext search?
  • Robin Rodricks
    Robin Rodricks over 13 years
    No, I meant to say can you search using #2 and #3 and sort by similarity using Levenshtein or similar? So you get results that are most similar on the top .. see the examples given in my question.
  • Yanick Rochon
    Yanick Rochon over 13 years
    I've added the case sensitive/insensitive portion of your question, however I'm afraid that an SQL only solution would not be as good as a Lucene solution. But this is only IMHO. Maybe someday, someone will implement a Lucene search function for MySQL and, frankly, I would LOVE to see that day, but meanwhile, this is as a best solution I can find right now.
  • opatut
    opatut over 13 years
    there you go, but i don't think sorting by Levenshtein makes sense when using LIKE. Why would you sort like this in your example (1. Adopt / 2. Adore / 3. Adorn)? With levenshtein, they have the same value (3, as you always have to add 3 characters)
  • Robin Rodricks
    Robin Rodricks over 13 years
    The MySQL Dam-Lev implementation was good, but the results it produces is quite erratic since the Lev philosophy is "measure edits" instead of "measure difference" .. please see my updated question above.
  • Michael Clerx
    Michael Clerx over 13 years
    I'll echo that. A mysql only solution's not going to be available any time soon.
  • Robin Rodricks
    Robin Rodricks over 13 years
    Can you help me out with Lucene? How do I get started with it to query for records that by similarity? Something like a search engine? I would give you the bounty if you can show me how to make it work.
  • Yanick Rochon
    Yanick Rochon over 13 years
    What language will you be using for Lucene? Java, PHP, ...?
  • Robin Rodricks
    Robin Rodricks over 13 years
    PHP with a simple MySQL database, locally hosted using XAMPP server. Do you know about Sphinx aswell? Would it be better suited to my requirements than Lucene? (sphinxsearch.com) Anyways I would appreciate it if you show me how to use Lucene for this project.
  • Yanick Rochon
    Yanick Rochon over 13 years
    Sphynx looks pretty good. You can find information about Lucene from Zend's website (you don't need the whole Zend Framework structure to use the Zend_Search_Lucene classes) everything is pretty much detailed. If you don't want to bother with Zend, Sphynx looks quite nice too! and it doesn't seem to require an overhead of keeping a separate index of your data.... I'll dig further myself about that one. Thanks for sharing this. :) Good luck!
  • Robin Rodricks
    Robin Rodricks over 13 years
    I got this tutorial for Zend_Search_Lucene and PHP. Any more help you can offer me with the specifics of using Lucene and stuff for my purpose would let me award you the bounty. See updated question. devzone.zend.com/article/91
  • Robin Rodricks
    Robin Rodricks over 13 years
    Thank you very much Yanick! Your answers are fantastic but I need help with a few more things: 1) Can you show me a simple MySQL query with fulltext columns to search for similar records? See my question. 2) What is the Lucene query string to search for similar records, with the most relevant 'matching' or 'containing' records on the top, and 'similar' or 'alike' records below that.
  • Yanick Rochon
    Yanick Rochon over 13 years
    Wasn't the first example about FULLTEXT search clear? FULLTEXT cannot search by similarity, the manual is pretty clear about what it can and cannot search. The BOOLEAN mode makes a second/third/etc. pass adding data to the query string from the results found in the last pass. As for Lucene it does that by default. You must read the manual to customize your query for you need. I would encourage you to play with it and learn the API yourself. From my point of view, I pretty much covered everything you asked in your question. No?
  • Robin Rodricks
    Robin Rodricks over 13 years
    Done. I'll look into Sphinx/Lucene. Thanks a million.
  • Walter Schrabmair
    Walter Schrabmair about 4 years
    @opatut Yes Levenshtein is a good choice. But how can i make a finding of the min of Levenshtein Distance when I have one set of strings which want to match against one other set of strings?