Algorithm to generate anagrams

48,954

Solution 1

Most of these answers are horribly inefficient and/or will only give one-word solutions (no spaces). My solution will handle any number of words and is very efficient.

What you want is a trie data structure. Here's a complete Python implementation. You just need a word list saved in a file named words.txt You can try the Scrabble dictionary word list here:

http://www.isc.ro/lists/twl06.zip

MIN_WORD_SIZE = 4 # min size of a word in the output

class Node(object):
    def __init__(self, letter='', final=False, depth=0):
        self.letter = letter
        self.final = final
        self.depth = depth
        self.children = {}
    def add(self, letters):
        node = self
        for index, letter in enumerate(letters):
            if letter not in node.children:
                node.children[letter] = Node(letter, index==len(letters)-1, index+1)
            node = node.children[letter]
    def anagram(self, letters):
        tiles = {}
        for letter in letters:
            tiles[letter] = tiles.get(letter, 0) + 1
        min_length = len(letters)
        return self._anagram(tiles, [], self, min_length)
    def _anagram(self, tiles, path, root, min_length):
        if self.final and self.depth >= MIN_WORD_SIZE:
            word = ''.join(path)
            length = len(word.replace(' ', ''))
            if length >= min_length:
                yield word
            path.append(' ')
            for word in root._anagram(tiles, path, root, min_length):
                yield word
            path.pop()
        for letter, node in self.children.iteritems():
            count = tiles.get(letter, 0)
            if count == 0:
                continue
            tiles[letter] = count - 1
            path.append(letter)
            for word in node._anagram(tiles, path, root, min_length):
                yield word
            path.pop()
            tiles[letter] = count

def load_dictionary(path):
    result = Node()
    for line in open(path, 'r'):
        word = line.strip().lower()
        result.add(word)
    return result

def main():
    print 'Loading word list.'
    words = load_dictionary('words.txt')
    while True:
        letters = raw_input('Enter letters: ')
        letters = letters.lower()
        letters = letters.replace(' ', '')
        if not letters:
            break
        count = 0
        for word in words.anagram(letters):
            print word
            count += 1
        print '%d results.' % count

if __name__ == '__main__':
    main()

When you run the program, the words are loaded into a trie in memory. After that, just type in the letters you want to search with and it will print the results. It will only show results that use all of the input letters, nothing shorter.

It filters short words from the output, otherwise the number of results is huge. Feel free to tweak the MIN_WORD_SIZE setting. Keep in mind, just using "astronomers" as input gives 233,549 results if MIN_WORD_SIZE is 1. Perhaps you can find a shorter word list that only contains more common English words.

Also, the contraction "I'm" (from one of your examples) won't show up in the results unless you add "im" to the dictionary and set MIN_WORD_SIZE to 2.

The trick to getting multiple words is to jump back to the root node in the trie whenever you encounter a complete word in the search. Then you keep traversing the trie until all letters have been used.

Solution 2

For each word in the dictionary, sort the letters alphabetically. So "foobar" becomes "abfoor."

Then when the input anagram comes in, sort its letters too, then look it up. It's as fast as a hashtable lookup!

For multiple words, you could do combinations of the sorted letters, sorting as you go. Still much faster than generating all combinations.

(see comments for more optimizations and details)

Solution 3

See this assignment from the University of Washington CSE department.

Basically, you have a data structure that just has the counts of each letter in a word (an array works for ascii, upgrade to a map if you want unicode support). You can subtract two of these letter sets; if a count is negative, you know one word can't be an anagram of another.

Solution 4

Pre-process:

Build a trie with each leaf as a known word, keyed in alphabetical order.

At search time:

Consider the input string as a multiset. Find the first sub-word by traversing the index trie as in a depth-first search. At each branch you can ask, is letter x in the remainder of my input? If you have a good multiset representation, this should be a constant time query (basically).

Once you have the first sub-word, you can keep the remainder multiset and treat it as a new input to find the rest of that anagram (if any exists).

Augment this procedure with memoization for faster look-ups on common remainder multisets.

This is pretty fast - each trie traversal is guaranteed to give an actual subword, and each traversal takes linear time in the length of the subword (and subwords are usually pretty darn small, by coding standards). However, if you really want something even faster, you could include all n-grams in your pre-process, where an n-gram is any string of n words in a row. Of course, if W = #words, then you'll jump from index size O(W) to O(W^n). Maybe n = 2 is realistic, depending on the size of your dictionary.

Solution 5

One of the seminal works on programmatic anagrams was by Michael Morton (Mr. Machine Tool), using a tool called Ars Magna. Here is a light article based on his work.

Share:
48,954
prakash
Author by

prakash

Dad, Dreamer, Developer

Updated on July 09, 2022

Comments

  • prakash
    prakash almost 2 years

    What would be the best strategy to generate anagrams.

    An anagram is a type of word play, the result of rearranging the letters
    of a word or phrase to produce a new  word or phrase, using all the original
    letters exactly once; 
    ex.
    
    • Eleven plus two is anagram of Twelve plus one
    • A decimal point is anagram of I'm a dot in place
    • Astronomers is anagram of Moon starers

    At first it looks straightforwardly simple, just to jumble the letters and generate all possible combinations. But what would be the efficient approach to generate only the words in dictionary.

    I came across this page, Solving anagrams in Ruby.

    But what are your ideas?

    • yellahd
      yellahd over 15 years
      settles back in anticipation..! If you need the output to be a clue for the original phrase, I don't really see how you could 'generate' it. Surely all you can do is generate a list of phrases/anagram pairings and pick from them? How could an algorithm understand astronomers=moon starers, eg?
    • Vinko Vrsalovic
      Vinko Vrsalovic over 15 years
      Of course generating GOOD anagrams is a hard problem, but generating bad anagrams is easier :)
  • Serafina Brocious
    Serafina Brocious over 15 years
    It seems like this (along with Jimmy's answer) would only work for a single word anagram -- how can this be applied to anagrams of phrases?
  • Jason Cohen
    Jason Cohen over 15 years
    As I said in the post, for multiple words you could examine all pairs, triples, etc., in each case combining the letters and sorting (and use mergesort so that op is faster!) and test against that combo. You could be smarter still, e.g. bitfields of chars used at all and...
  • Jason Cohen
    Jason Cohen over 15 years
    ...obviously the total number of characters, so when I say "test all triples" there are massive categories you can prune. For example, store words first by length, then hash. So in your pairs/triples you're already able to skip combos with wrong numbers of characters.
  • Mats Fredriksson
    Mats Fredriksson over 15 years
    sorting the characters first doesn't help at all. It might give you one or two, but you need to test all combinations and then reject them. One way would be to generate all possible triplets and then compare them to the first three letters of all words from a dictionary.
  • Boti
    Boti over 15 years
    sorting does help -- it's the simplest way (aside from say, using .NET HashSet<T> or Python set()) to map a ordered list of letters to an unordered list.
  • Mats Fredriksson
    Mats Fredriksson over 15 years
    ok, fair enough, it speeds up things in that the anagrams of "foobar" and "barfoo" will resolve to the same result set, but if you are going to get all anagrams from just one sentence, then sorting doesn't help you since you need to consider all characters available.
  • Boti
    Boti over 15 years
    its a matter of reverse lookup, too. once you have your big list of letters ("astronomers"), you find a list of sorted substrings ("mno" + "aat" + "sors", or "mnoo"+"aerrsst" for example) so you can look it up in the lookup table you gener
  • Zach Langley
    Zach Langley over 15 years
    @Jason, bitwise operations won't work because a letter may appear more than once in the String. If you use OR, these duplicate letters won't be counted, and if you use addition, there will be collisions.
  • vinc456
    vinc456 over 15 years
    Don't know why you were modded down but Column 2 of Programming Pearls walks through an implementation of a program that finds all sets of anagrams given a dictionary of words. Definately worth a look. Compile and run the code as follows: ./sign <somedictionary| sort | ./squash
  • Osama Al-Maadeed
    Osama Al-Maadeed over 15 years
    working with the counts makes it simple combination problem. you have a map for the search phrase, and match it to combinations of word maps with the same sum of counts. This is an elegant solution.
  • FogleBird
    FogleBird over 14 years
    Inefficient for multiple words. See my answer.
  • Tony Veijalainen
    Tony Veijalainen about 13 years
    Only 16818 anagrams for astronomers with word length 1 from my anagram program, as it does not give out permutations. Running time around 2 s to produce the results with my AMD Sempron humble computer. I save the results to file, it is more usefull than flood of words to text console. I do not use tree structures but plain text with recursion matching the keys from dictionary hashed with sorted letters keys.
  • IgorGanapolsky
    IgorGanapolsky about 12 years
    The big question here is: what do the dictionary keys look like? I would make them sorted strings. And values would be all possible anagrams of the keys.
  • IgorGanapolsky
    IgorGanapolsky about 12 years
    Why such a complicated dictionary... prime numbers, pre-processing, multimap? Just make your dictionary keys to be sorted strings.
  • Tony Veijalainen
    Tony Veijalainen almost 12 years
    I have posted my previously code in DaniWeb as daniweb.com/software-development/python/code/393153/….
  • Martin J.H.
    Martin J.H. over 10 years
    Bug report: If the wordlist has two entries: "foobar" and "foob" (in that order), then the code snippet won't find an anagram for "boof". Only if you reverse the order of the wordlist, then it correctly returns "foob". I think this can be fixed by putting another if clause into the very first for loop, but I'll have to leave that to someone who knows Python.
  • MadPhysicist
    MadPhysicist almost 8 years
    Could you describe your algorithm in a couple of sentences? What I am particularly interested in is what happens after you decide that a certain dictionary word can be composed using some letters of the input. I understand that we then check to see if the remaining characters can be used to compose some other word. How do we know that we have exhausted all possibilities?
  • Brian Clapper
    Brian Clapper over 7 years
  • nicecatch
    nicecatch about 7 years
    I've deleted my previous comment because it was wrong. Anyway: ("az".sum + "by".sum) - "mmnn".sum => 0. That checksum function is not good for anagram solving
  • martinus
    martinus about 7 years
    It's not perfect, but very fast. You need to do a final check with any checksum because the possibility of collisions does not go away.
  • Adamantish
    Adamantish over 4 years
    @MadPhysicist the trie structure allows you to take particular advantage of how in english a lot of words are the same but with different endings. So if your input letters for the anagram contain "q", "u" but not "i" then with just 3 moves we can eliminate "quick", "quickly", "quicker" ,"quicken", etc... So it's a structure which groups words into subsets of each other in a practical way. I suspect there's another data structure which also allows you to eliminate all words with the letter "i" and doesn't care about letter order but not sure how to keep the size of it tractable.
  • Adamantish
    Adamantish over 4 years
    @IgorGanapolsky Because by itself that can only give you single word anagrams. The example of "Eleven plus two" wouldn't be possible as an output.
  • Adamantish
    Adamantish over 4 years
    @MadPhysicist The first thing it does is find all words that could be inside the full anagram sentence. It does that by testing all children from the top node then uses recursion to follow in-turn all the grandchildren of those children that passed, etc... That exhausts all possibilities for the first word. For each of those possible first words it then repeats the whole process to get another word with the remaining letters and round again for each of those until all letters are used up just right. As you can imagine, even this can't work well with a long input string.