What algorithm gives suggestions in a spell checker?

58,956

Solution 1

There is good essay by Peter Norvig how to implement a spelling corrector. It's basicly a brute force approach trying candidate strings with a given edit distance. (Here are some tips how you can improve the spelling corrector performance using a Bloom Filter and faster candidate hashing.)

The requirements for a spell checker are weaker. You have only to find out that a word is not in the dictionary. You can use a Bloom Filter to build a spell checker which consumes less memory. An ancient versions is decribed in Programming Pearls by Jon Bentley using 64kb for an english dictionary.

A BK-Tree is an alternative approach. A nice article is here.

Levenshstein distance is not exactly the right edit distance for a spell checker. It knows only insertion, deletion and substitution. Transposition is missing and produces 2 for a transposition of 1 character (it's 1 delete and 1 insertion). Damerau–Levenshtein distance is the right edit distance.

Solution 2

An approach for generating suggestions that I've used successfully but never seen described anywhere is to pre-compute suggestions (when building the dictionary) by using "bad" hash functions.

The idea is to look at the types of spelling errors people make, and to design hash functions that would assign an incorrect spelling to the same bucket as its correct spelling.

For example, a common mistake is to use the wrong vowel, like definate instead of definite. So you design a hash function that treats all vowels as the same letter. An easy way to do that is to first "normalize" the input word and then put the normalized result through a regular hash function. In this example, the normalizing function might drop all the vowels, so definite becomes dfnt. The "normalized" word is then hashed with a typical hash function.

Insert all of your dictionary words into an auxiliary index (hash table) using this special hash function. The buckets in this table will have longish collision lists because the hash function is "bad", but those collision lists are essentially pre-computed suggestions.

Now, when you find a misspelled word, you look up the collision lists for the bucket that the misspelling maps to in the auxiliary indexes. Ta da: You have a suggestion list! All you have to do is rank the words on it.

In practice, you'll need a few auxiliary indexes with other hash functions to handle other types of errors, like transposed letters, single/double letter, and even a simplistic Soundex-like one to catch phonetic misspellings. In practice, I found simplistic pronunciation ones to go a long way and essentially obsolete some of the ones designed to find trivial typos.

So now you look up misspellings in each of the auxiliary indexes and concatenate the collision lists before ranking.

Remember the collision lists contain only words that are in the dictionary. With approaches that try to generate alternate spellings (as in the Peter Norvig article), you can get (tens of) thousands of candidates that you first have to filter against the dictionary. With the pre-computed approach, you get maybe a couple hundred candidates, and you know that they're all correctly spelled, so you can skip straight to ranking.

Update: I've since found one algorithm description that's similar to this, the FAROO Distributed Search. This is still an edit-distance limited search, but it's very fast because the pre-calculation step works like my "bad hash functions" idea. FAROO just uses a limited concept of a bad hash function.

Solution 3

Algorithm

  1. Take a wrongly spelled word as input.
  2. Store the list of English words along with their frequencies in a text file.
  3. Insert all the available English words (stored in the text file) along with their frequencies (measure of how frequently a word is used in English language) in a Ternary Search Tree.
  4. Now traverse along the Ternary Search Tree -
    • For each word encountered in the Ternary Search Tree, calculate its Levensthein Distance from the wrongly spelled word.
    • If Levensthein Distance <= 3, store the word in a Priority Queue.
    • If two words have same edit distance, the one with higher frequency is grater. Print the top 10 items from Priority Queue.

Optimization

  1. You can eleminate the words in subtree of current node if the edit distance of substring of input word from the current word is grater than the 3.
    You can find the more detailed explanation and source code on github project.

Solution 4

You don't need to know exact edit distance for each word in dictionary. You can stop the algorithm after reaching a limit value and exclude the word. This will save you a lot of computing time.

Solution 5

Spell checker is very easy to implement as in Unix spell program. The source code is available in public. The correction can be involved, one technique is to do edits and again check if this new word is in the dictionary. Such new edits can be grouped and shown to the user.

Unix system uses a program written by Mc IllRoy. An alternative way is to use a Trie which can be useful in the case of huge files.

The unix approach need very less space for a huge dictionary since it uses scatter hash algorithm.

Share:
58,956
Mithrax
Author by

Mithrax

Updated on March 16, 2020

Comments

  • Mithrax
    Mithrax about 4 years

    What algorithm is typically used when implementing a spell checker that is accompanied with word suggestions?

    At first I thought it might make sense to check each new word typed (if not found in the dictionary) against it's Levenshtein distance from every other word in the dictionary and returning the top results. However, this seems like it would be highly inefficient, having to evaluate the entire dictionary repeatedly.

    How is this typically done?

  • NoozNooz42
    NoozNooz42 almost 14 years
    +1 for the relatively unknown BK-Tree reference. That's how companies like Google, working with Real-World [TM] amount of data, are doing it.
  • mistertodd
    mistertodd about 10 years
    There is a much better explanation of BK-Trees here.
  • Wolf Garbe
    Wolf Garbe about 8 years
    Thanks for referencing Faroos's SymSpell algorithm. While both algorithms are pre-calculating possible typos and using a hash table for fast lookup, the main difference is that SymSpell guarantees to detect all possible spelling errors up to a certain edit distance (in this respect SymSpell is equivalent to Peter Norvig’s algorithm, just 3..6 orders of magnitude faster), while your algorithm is using a heuristic approach which will detect only a limited subset of all theoretically possible spelling errors (therefore your pre-calculation cost might be lower).
  • Adrian McCarthy
    Adrian McCarthy about 8 years
    The SymSpell algorithm explicitly pre-calculates and stores possible typos, my "bad hash" scheme does not. For English, it's trivial to add just one simplistic phonetic hash that covers a lot of ground (e.g., "terradacktle" -> "pterodactyl," which has an edit distance of 6). Granted, if you need multi-lingual lookups, then it might be much harder.
  • Wolf Garbe
    Wolf Garbe about 8 years
    Absolutely, by exploiting empirical knowledge about likely typos (and restricting to those) you save pre-calculation time & space. But to cover all possible spelling errors SymSpell needs to pre-calculate only a tiny fraction of them. A 5 letter word has about 3 million possible spelling errors within a maximum edit distance of 3, but with SymSpell you need to pre-calculate & store only 25 deletes. This is important for fuzzy/similarity search beyond spelling correction where no empirical knowledge exists.
  • Tony Brasunas
    Tony Brasunas almost 7 years
    Hmm, the Levenshtein distance from 'grater' to 'greater' in this case wouldn't be enough, as 'grater' is also a dictionary word. ;-)
  • amarjeetAnand
    amarjeetAnand almost 7 years
    @TonyBrasunas , Yes you are right. But the program will actually return a list of 10 words in case of 'grater' as input and it will suggest 'grater' with edit distance of 0 and also 'greater' with edit distance of 1. Which might be of some help. ;)
  • speedplane
    speedplane about 6 years
    If one candidate has a distance of 2, but is extremely frequent, and another candidate has a distance of 1 but is extremely rare, how do you rank the two? In the above approach, the rare item would always win, is this the right result?
  • amarjeetAnand
    amarjeetAnand about 6 years
    @speedplane Yes. the rare one will win. And i think it's the right result. Becasue what we expect is the closest word, based on the spelling of input word. If you are still in doubt, think this way --- suppose there is a rare word which user spelled correctly. Now its distance is 0 but frequency very low. Now in suggestions, we should list this rare word(with distance 0) at the top(because least edit distance wins) and other words with distance 1-2-3, below.