HashMap in Java, 100 Million entries

67,687

Solution 1

For word processing like that the answer is usually a tree rather than hashmap, if you can live with the longer lookup times. That structure is quite memory efficient for natural languages, where many words have common start strings.

Depending on the input, a Patricia tree might be even better.

(Also, if this is indeed words from a natural language, are you sure you really need 100,000,000 entries? The majority of commonly used words is surprisingly low, commercial solutions (word prediction, spelling correction) rarely use more than 100,000 words regardless of language.)

Solution 2

Your problem is that 1.7 GB raw Text is more than 1500 MB even without the overhead added by the individual string objects. For huge mappings you should either use a database or a file backed Map, these would use disk memory instead of heap.

Update

I don't think allocating 15 GB for the heap is possible for most jvms. It wont work with any 32bit jvm and I don't think that a 64bit jvm would work either. 15 GB of memory should work on a 64 bit jvm when enough RAM is available.

Solution 3

A 1.7 GB file is a relatively small file to do this with and store in RAM. I do this with much larger files and store them in memory without a problem. A database could be used, but may be overkill or may be perfect depending on what you plan to do with the data.

As others have said, with natural language, there will likely be a relatively small number of unique values, so the map will not actually get that large. I would not use a java.util.HashMap as it is is very inefficient in terms of memory usage especially when storing primitive values such as ints. java.util.HashMap stores primitives as objects. It also stores each value inside of a HashMap.Entry object which wastes memory. Because of these two factors, the java.util.HashMap uses much more memory than alternatives such as Trove, Fastutil and others:

As mentioned there are several map implementations which do not have these problems. Since you are storing numbers in your map, an extra benefit is that you will get a performance boost because there is no need to constantly switch between objects and primitives (i.e. boxing/unboxing) as you are putting new values in the map or updating old values. A benchmark of various primitive hashmaps that are better suited for large amounts of data can be found on this post at the Java Performance Tuning Guide:

Solution 4

With 100 million terms you are almost certainly over the limit of what should be stored in-memory. Store your terms in a database of some kind. Either use a commercial database, or write something that allows you to access the file to get the information you want. If the file format you have doesn't let you quickly access the file then convert it to one that does - for example make each record a fixed size, so you can instantly calculate the file offset for any record number. Sorting the records will then allow you to do a binary search very quickly. You can also write code to hugely speed up access to the files without needing to store the whole file in memory.

Solution 5

If you want just a lightweight KeyValue (Map) store, I would look into using Redis. It is very fast and has the ability to persist the data if it needs. The only downside is that you need to run the Redis store on a linux machine.

If you are limited to Windows, MongoDB is a good option if you can run it in 64bit.

Share:
67,687
ablimit
Author by

ablimit

I'm student in cs.I'm picking up Perl and Python recently.

Updated on September 24, 2022

Comments

  • ablimit
    ablimit almost 2 years

    I want to store 100 Million terms and their frequencies (in a text database ) into a HashMap <String, Double>. It is giving me "Out of Memory" Error. I tried to increase the heap-space to -Xmx15000M. However it runs half an hour then again throw the same exception. The file size from which I'm trying to read the words and frequencies is 1.7GB.

    Any help would be much appreciated.

    Thanks :-)