get list of anagrams from a dictionary

19,808

Solution 1

The obvious solution is to map each character to a prime number and multiply the prime numbers. So if 'a'' -> 2 and 'b' -> 3, then

  • 'ab' -> 6
  • 'ba' -> 6
  • 'bab' -> 18
  • 'abba' -> 36
  • 'baba' -> 36

To minimise the chance of overflow, the smallest primes could be assigned to the more frequent letters (e,t,i,a,n). Note: The 26th prime is 101.

UPDATE: an implementation can be found here

Solution 2

One possible hash function could be (assuming english words only) a sorted count of the number of occurrences of each letter. So for "anagram" you would generate [('a', 3), ('g', 1), ('n', 1), ('m', 1), ('r',1)].

Alternatively you could get an inexact grouping by generating a bitmask from your word where for bits 0-25 each bit represented the presence or absence of that letter (bit 0 representing 'a' through to bit 25 representining 'z'). But then you'd have to do a bit more processing to split each hashed group further to distinguish e.g. "to" from "too".

Do either of these ideas help? Any particular implementation language in mind (I could do C++, python or Scala)?

Edit: added some example Scala code and output:

OK: I'm in Scala mode at the moment, so I've knocked something up to do what you ask, but (ahem) it may not be very clear if you're not that familiar with Scala or functional programming.

Using a big list of english words from here: http://scrapmaker.com/data/wordlists/twelve-dicts/2of12.txt

I run this Scala code on them (takes about 5 seconds using Scala 2.9 in script mode, including time to compile, with a dictionary of about 40,000 words. Not the most efficient code, but the first thing that came to mind).

// Hashing function to go from a word to a sorted list of letter counts
def toHash(b:String) = b.groupBy(x=>x).map(v => (v._1, v._2.size) ).toList.sortWith(_._1 < _._1)


// Read all words from file, one word per line
val lines = scala.io.Source.fromFile("2of12.txt").getLines

// Go from list of words to list of (hashed word, word)
val hashed = lines.map( l => (toHash(l), l) ).toList

// Group all the words by hash (hence group all anagrams together)
val grouped = hashed.groupBy( x => x._1 ).map( els => (els._1, els._2.map(_._2)) )

// Sort the resultant anagram sets so the largest come first
val sorted = grouped.toList.sortWith( _._2.size > _._2.size )

for ( set <- sorted.slice(0, 10) )
{
    println( set._2 )
}

This dumps out the first 10 sets of anagrams (sets with the most members first) being:

List(caret, cater, crate, react, trace)
List(reins, resin, rinse, risen, siren)
List(luster, result, rustle, sutler, ulster)
List(astir, sitar, stair, stria, tarsi)
List(latrine, ratline, reliant, retinal)
List(caper, crape, pacer, recap)
List(merit, miter, remit, timer)
List(notes, onset, steno, stone)
List(lair, liar, lira, rail)
List(drawer, redraw, reward, warder)

Note that this uses the first suggestion (list of counts of letters) not the more complicated bitmask method.

Edit 2: You can replace the hash function with a simple sort on the chars of each word (as suggested by JAB) and get the same result with clearer/faster code:

def toHash(b:String) = b.toList.sortWith(_<_)

Solution 3

If you XOR the hash-code values of each character, and then XOR the result by the input length, you will get the same value regardless of the order of the word, meaning that all anagrams will produce the same hash. (XORing by the length prevents 'boss' and 'bo' from returning the same value, because the hash of the 's' against itself is always 0.)

Example:

int AnagramHash(string input)
{
    int output = 0;

    foreach(char c in input)
        output ^= c.GetHashCode();

    return output ^ input.Length;
}

You will still have to search for all words with the same AnagramHash. I would update the dictionary table with a field for the hash (regardless of your algorithm) to reduce overall computation.

EDIT: Also, as a side note, XOR is the simplest operation performed by the ALU so if you do end up using it, you should be able to generate your hashes fairly quickly.

Share:
19,808
vijay
Author by

vijay

Android Developer

Updated on June 26, 2022

Comments

  • vijay
    vijay about 2 years

    Basically, Anagrams are like permutation of string.E.g stack ,sackt ,stakc all are anagrams of stack (thought above words aren't meaningful). Anyways you could have understood what I basically meant.

    Now, I want a list of anagrams given million words or simply say from a dictionary.

    My basic question is Find total number of unique anagrams in a dictionary?

    Sorting and comparing won't work as it's time complexity is pretty bad.

    I thought of using hash table, string as key.

    But the problem is what should be the hash function ? It would be helpful if some pseudocode provided. Some other approaches better than mentioned approaches would also be helpful.

    Thanks.