How to implement a dictionary (Trie vs HashTable and important issues)?

22,995

Solution 1

I want to address just one point in your question:

A trie is not a general-purpose dictionary data structure. The reason is that a trie is a specialized search tree for (sub)string search. Generally, you will be more interested in general search trees, e.g. binary search trees or B-trees.

All these implementations rely on an ordering of the dictionary elements, and all of them have a logarithmic average-case and worst-case runtime for common operations.

A hash table, by contrast, does not require a relative ordering of the elements. Instead, it requires that elements are hashable and equality comparable. The worst-case characteristic of common hash table characteristics is much worse than for trees, namely linear in the number of elements.

However, with a bit of care the average case for hash tables operations can be made constant (i.e. independent of the container size). What’s more, it can be proven that slower operations are exceedingly rare.

In practice, this means that except for very specialized use-cases, hash tables beat tree-based dictionaries hands down.

The downside to this is that hash tables impose an arbitrary-seeming order on its elements. If you are interested in getting the items from your dictionary in sorted order, hash tables are not for you.

(There are other interesting implementations of dictionaries, e.g. skip lists which rival search trees and probabilistic implementations like the Bloom filter.)

A trie-based implementation can only be used if you are dealing with a dictionary of string values, in which case it is actually often a good choice, in particular if many strings in the dictionary share common prefixes and are rather short.

Solution 2

EDIT stop upvoting this: I misread the question. The OP is not after a dictionary to verify word spellings/suggestions/type-ahead-lookup/auto-completion/whatever (which I thought was what he was after). The OP is after a key/value mapping where for each word there's a definition.

Having worked on dictionaries, I can tell you that you're taking the wrong approach.

It's not as simple as a choice between an hashtable or a trie.

You mention Lingvo: it's much more than just a table.

Do you want close match to be offered suggestions? You may then need to things like generating permutations on what the user entered and for each permutation see if it exists in the dico: if it does, you'd then need to compute its' Levenhstein Edit Distance and suggest first the words that have the shortest LED.

Do you want most likely matches to be auto-completed/suggested (like what Google does)? You'd then need a very advanced data structure like a BK-tree (basically a tree of LED if I understand it correctly).

How many words will you have in your dictionary? You won't be able to use a dictionary made of 400 000 words using Strings and other heavyweight Java objects / data structure without a serious performance hit (once again: a dictionary is more than just one hashtable, a dictionary typically involve several datastructures). This won't easily fit in your users' computer memory. There are known, searchable, ways to store words where every single word can be packed on fewer than 15 bits per word (fewer than 15 bits per word, you read correctly).

In addition to that, you may want to do suggestion based on phonetics: like by using a double-metaphone mapping.

A dictionary, as in a "word dictionary" is so much more than just a key/value table. It is really a complicated beast due to which features the user shall except and due to the amount of data involved. Just plain english + a few specialized domains terminologies, medical, comp-sci, whatever. will give you hundreds of thousands of data: try to put that in a Java HashMap and... Kaboom!

Solution 3

Dictionary implementation in Java, definitely hash collections are best bet.

Regarding HashMap or HashTable : Mainly if your class is used in multithreaded manner than you have to use HashTable, otherwise HashMap is the best option.

HashMap vs TreeMap: If you need insertion order into collection then we have to use TreeMap.

HashMap vs LinkedHashMap: LinkedHashMap implementation differs from HashMap in that it maintains a doubly-linked list running through all of its entries. This linked list defines the iteration ordering, which is normally the order in which keys were inserted into the map (insertion-order). Note that insertion order is not affected if a key is re-inserted into the map. (A key k is reinserted into a map m if m.put(k, v) is invoked when m.containsKey(k) would return true immediately prior to the invocation.)

Share:
22,995
Denys S.
Author by

Denys S.

We keep moving forward, opening new doors, and doing new things, because we're curious and curiosity keeps leading us down new paths. (Walt Disney) Imagination is more important than knowledge. For knowledge is limited to all we now know and understand, while imagination embraces the entire world, and all there ever will be to know and understand. (Albert Einstein)

Updated on February 11, 2020

Comments

  • Denys S.
    Denys S. about 4 years

    I've ran across several questions and articles saying that dictionary implementation in java is done best using tries. But most of them didn't address important issues, as far as I saw it. So, next is a real world task:

    Let's assume that I need to implement a dictionary (let's say something like Lingvo, but simpler) using java. For my particular task it is needed to store word definitions and to perform fast dictionary lookup.

    Please, address next questions:

    • What data structure should I use then (Trie or HashTable)?
    • How should it(search, datastruct) be organised if I need the dictionary to be case insensitive?
    • What if I want it(search, dictionary) to be case sensitive?

    P.S.: Code examples are highly appreciated. :)

    Thanks for answers in advance.

    UPDATE:If we're talking about standard DS implementations in java, is it true that HashTable will be the best one for this particular task? Why not HashMap, TreeMap or LinkedHashMap?

  • Denys S.
    Denys S. over 13 years
    If we're talking about standard DS implementations in java, is it true that HashTable will be the best one for this particular task? Why not HashMap, TreeMap or LinkedHashMap?
  • KitsuneYMG
    KitsuneYMG over 13 years
    @den-javamaniac HashTable is a thread-safe HashMap (Much like Vector vs ArrayList) as such HashMap is better when you know multiple threads will not interact with it. Curiously, Collections.synchronizedMap( new HashMap() ) is faster than HashTable and seems to offer the same safety. A TreeMap requires that it's keys be Comparable and uses a Red-Black tree. LinkedHashMap uses a left/right/parent reference (IIRC) instead of an array. This is akin to the differences between ArrayList and LinkedList. Personally, in java, I rarely use the Linked... collections.
  • Gugussee
    Gugussee over 13 years
    Your answer is very misleading. Dictionaries are one of the typical application of tries. One of the huge advantage of a trie is that it may work, memory-wise, where an hashtable would chocke when there are a lot of data and where a lot of these data are sharing common prefixes. Which is exactly what a dictionary is.
  • Denys S.
    Denys S. over 13 years
    Ups, sorry, confused with all the DS (concerning HashTable vs HashMap). Well, if need a dictionary to store ordered words, than TreeMap appears to be the best, isn't it?
  • Denys S.
    Denys S. over 13 years
    If to say, that I will have about 200 000 words and Levenshtein distance is not covered, as well as auto suggestions/complitions. What would be your suggestion of how to organize the dictionary?
  • Konrad Rudolph
    Konrad Rudolph over 13 years
    @Gugussee: (EDITED!) Sorry, you are right. I thought I had made it clear that I was talking about general purpose dictionaries, and that tries only work for string searches. But looking again, that’s not clear at all. I will update my answer.
  • Gugussee
    Gugussee over 13 years
    @den-javamiac: if you really want something simple like contains(word)? type of queries, then a simple HashMap<String,String> will do. However I also misread the question: depending on how big your typical definitions are, storing 200 000 definitions may or may not be a problem. For example, if every "definition" is the size of a Wikipedia entry, you've got a problem. How many characters will every definition be, on average?
  • Denys S.
    Denys S. over 13 years
    Let's say that word definition is 50 chars. I still don't get why HashTable/HashMap is better in case when I need a fast dictionary lookup and case sensitivity/insensitivity. Can you, please, explain?
  • RnMss
    RnMss over 11 years
    I think a Trie IS a general-purpose dictionary (tree) data structure. The data structure (for substring search) you are talking about is probably an AC-automation. (Or somebody calls that a Trie-Graph, which has extra back trace edges besides the tree-edges of a Trie.)
  • Konrad Rudolph
    Konrad Rudolph over 11 years
    @RnMss What I meant is that tries have strings (or more generally, something with a lexicographical ordering) as keys, hence are not general-purpose – althought Wikipedia explicitly mentions a bitwise trie to store floating point numbers but this still can’t be generalised well to arbitrary types. And yes, the Aho-Corasick automaton is a specialised case that is optimised for substring searches.