Reading large file in Java -- Java heap space

16,848

Solution 1

Most likely, what's going on is that the file does not have line terminators, and so the reader just keeps growing it's StringBuffer unbounded until it runs out of memory.

The solution would be to read a fixed number of bytes at a time, using the 'read' method of the reader, and then look for new lines (or other parsing tokens) within the smaller buffer(s).

Solution 2

Are you certain the "lines" in the file are separated by newlines?

Solution 3

I have 3 theories:

  • The input file is not UTF-8 but some indeterminate binary format that results in extremely long lines when read as UTF-8.

  • The file contains some extremely long "lines" ... or no line breaks at all.

  • Something else is happening in code that you are not showing us; e.g. you are adding new elements to set.


To help diagnose this:

  • Use some tool like od (on UNIX / LINUX) to confirm that the input file really contains valid line terminators; i.e. CR, NL, or CR NL.
  • Use some tool to check that the file is valid UTF-8.
  • Add a static line counter to your code, and when the application blows up with an OOME, print out the value of the line counter.
  • Keep track of the longest line seen so far, and print that out as well when you get an OOME.

For the record, your slightly suboptimal use of trim will have no bearing on this issue.

Solution 4

One possibility is that you are running out of heap space during a garbage collection. The Hotspot JVM uses a parallel collector by default, which means that your application can possibly allocate objects faster than the collector can reclaim them. I have been able to cause an OutOfMemoryError with supposedly only 10K live (small) objects, by rapidly allocating and discarding.

You can try instead using the old (pre-1.5) serial collector with the option -XX:+UseSerialGC. There are several other "extended" options that you can use to tune collection.

Share:
16,848
user431336
Author by

user431336

Updated on June 15, 2022

Comments

  • user431336
    user431336 almost 2 years

    I'm reading a large tsv file (~40G) and trying to prune it by reading line by line and print only certain lines to a new file. However, I keep getting the following exception:

    java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2894)
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:532)
        at java.lang.StringBuffer.append(StringBuffer.java:323)
        at java.io.BufferedReader.readLine(BufferedReader.java:362)
        at java.io.BufferedReader.readLine(BufferedReader.java:379)
    

    Below is the main part of the code. I specified the buffer size to be 8192 just in case. Doesn't Java clear the buffer once the buffer size limit is reached? I don't see what may cause the large memory usage here. I tried to increase the heap size but it didn't make any difference (machine with 4GB RAM). I also tried flushing the output file every X lines but it didn't help either. I'm thinking maybe I need to make calls to the GC but it doesn't sound right.

    Any thoughts? Thanks a lot. BTW - I know I should call trim() only once, store it, and then use it.

    Set<String> set = new HashSet<String>();
    set.add("A-B");
    ...
    ...
    static public void main(String[] args) throws Exception
    {
       BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile),"UTF-8"), 8192);
       PrintStream output = new PrintStream(outputFile, "UTF-8");
    
       String line = reader.readLine();
       while(line!=null){
            String[] fields = line.split("\t");
            if( set.contains(fields[0].trim()+"-"+fields[1].trim()) )
                output.println((fields[0].trim()+"-"+fields[1].trim()));
    
            line = reader.readLine();
       }
    
    output.close();
    
    }
    
  • Brian Roach
    Brian Roach almost 13 years
    He's not creating anything. He's declaring a variable that holds a reference to an array of String objects (returned by split()). Since its required scope is only in the loop, it's perfectly fine to declare it there.
  • bstick12
    bstick12 almost 13 years
    The String[] is a local variable within the scope of the loop and any allocated memory for the array will be garbage collected by the JVM.
  • Steven Fines
    Steven Fines almost 13 years
    This would probably be a good place for the NIO package- he'd need all of the performance he can get to process 40GB or so of text data.
  • user431336
    user431336 almost 13 years
    It makes a lot of sense now because I noticed that doesn't matter the max heap size I set, the final output file size is always the same. So I suspect that there's one line somewhere that causes the trouble. I'm now checking it. Thanks a lot!
  • user431336
    user431336 almost 13 years
    Somewhere in the file this is probably the problem. Thanks a lot.
  • Steven Fines
    Steven Fines almost 13 years
    @user431336: Also, don't forget to close your PrintStream... your example leaves it open when you terminate the method.
  • user431336
    user431336 almost 13 years
    Thanks a lot for this great answer and excellent suggestions!
  • user431336
    user431336 almost 13 years
    @Dataknife the PrintStream? I do close it once the loop terminates.
  • Vladimir Kroz
    Vladimir Kroz over 12 years
    Corrupt file with missing line terminators would be first thing to check -- I had exactly the same situation while reading 4Gb ASCII file. Try command "tail <your file_name>" to see if it prints and exists correctly. For corrupt file it will be stuck without exiting
  • ban-geoengineering
    ban-geoengineering over 8 years
    @BrianRoach Correct me if I'm wrong, but surely a String[] is being created every time split() is called? I get where @Shaunak is coming from - if a String[] is being created (and GC'd) on every loop, would it not be more efficient to declare it before the loop, re-use it on each iteration, then set it to null (for GC) after the loop? (I'm sure this is the way it was taught back in the J2ME days!...)