php (fuzzy) search matching

10,256

Solution 1

Unfortunately, doing this in PHP is prohibitively expensive (high CPU and memory utilization.) However, you can certainly apply the algorithm to small data sets.

To specifically expand on how you can create a server meltdown: couple of built-in PHP functions will determine "distance" between strings: levenshtein and similar_text.

Dummy data: (pretend they're news headlines)

$titles = <<< EOF
Apple
Apples
Orange
Oranges
Banana
EOF;

$titles = explode("\n", $titles );

At this point, $titles should just be an array of strings. Now, create a matrix and compare each headline against EVERY other headline for similarity. In other words, for 5 headlines, you will get a 5 x 5 matrix (25 entries.) That's where the CPU and memory sink goes in.

That's why this method (via PHP) can't be applied to thousands of entries. But if you wanted to:

$matches = array();
foreach( $titles as $title ) {
    $matches[$title] = array();
    foreach( $titles as $compare_to ) {
        $matches[$title][$compare_to] = levenshtein( $compare_to, $title );
    }
    asort( $matches[$title], SORT_NUMERIC  );
}

At this point what you basically have is a matrix with "text distances." In concept (not in real data) it looks sort of like this table below. Note how there is a set of 0 values that go diagonally - that means that in the matching loop, two identical words are -- well, identical.

       Apple Apples Orange Oranges Banana
Apple    0     1      5      6       6
Apples   1     0      6      5       6
Orange   5     6      0      1       5
Oranges  6     5      1      0       5
Banana   6     6      5      5       0

The actual $matches array looks sort of like this (truncated):

Array
(
    [Apple] => Array
        (
            [Apple] => 0
            [Apples] => 1
            [Orange] => 5
            [Banana] => 6
            [Oranges] => 6
        )

    [Apples] => Array
        (
      ...

Anyhow, it's up to you to (by experimentation) determine what a good numerical distance cutoff might mostly match - and then apply it. Otherwise, read up on sphinx-search and use it - since it does have PHP libraries.

Orange you glad you asked about this?

Solution 2

I would suggest taking the users submitted URLs and storing them in multiple parts; domain name, path and query string. Use the PHP parse_url() function to derive the parts of the submitted URL.

Index at least the domain name and path. Then, when a new user submits URL you search your database for a record matching the domain and path. Since the columns are indexed, you will be filtering out first all records that are not in the same domain, and then searching through the remaining records. Depending on your dataset, this should be faster that simply indexing the entire URL. Make sure your WHERE clause is setup in the right order.

If that does not meet your needs I would suggest trying Sphinx. Sphinx is an open source SQL full text search engine that is far faster that MySQL's built in full-text search. It supports stemming and some other nice features.

http://sphinxsearch.com/

You could also take the title or text content of the users submission, run it through a function to generate keywords, and search the database for existing records with those or similar keywords.

Solution 3

You could (depending on the size of your dataset) use mySQL's FULLTEXT search, and look for item(s) that have a high score and are within a certain timeframe, and suggest this/these to the user.

More about score here: MySQL Fulltext Search Score Explained

Share:
10,256
chris
Author by

chris

Updated on July 26, 2022

Comments

  • chris
    chris almost 2 years

    if anyone has ever submitted a story to digg, it checks whether or not the story is already submitted, I assume by a fuzzy search.

    I would like to implement something similar and want to know if they are using a php class that is open source?

    Soundex isnt doing it, sentences/strings can be up to 250chars in length

  • Pete
    Pete almost 14 years
    Unfortunately, programming is lot to do with mathematics.
  • Parapluie
    Parapluie almost 4 years
    Nice theoretical explanation. You've shed some light on this!