Java read huge file ( ~100GB ) efficiently

11,861

If this is a binary file, then reading in "lines" does not make a lot of sense.

If the file is really binary, then use a BufferedInputStream and read bytes one at a time into byte[]. When you get to the byte that marks your end of "line", add the byte[] and the count of bytes in the line to a queue for you worker threads to process.

And repeat.

Tips:

  • Use a bounded buffer in case you can read lines faster than you can process them.
  • Recycle the byte[] objects to reduce garbage generation.

If the file is (really) text, then you could use BufferedReader and the readLine() method instead of calling read().


The above will give you reasonable performance. Depending on how much work has to be done to process each line, it may be good enough that there is no point optimizing the file reading. You can check this by profiling.

If you profiling tells you that reading is the bottle-neck, then consider using NIO with ByteBuffer or CharBuffer. It is more complicated but potentially faster than read() or readLine().


Does reading in chunks work?

BufferedReader or BufferedInputStream both read in chunks, under the covers.

What will be the optimum buffer size?

That's probably not that important what the buffer size is. I'd make it a few KB or tens of KB.

Any formula for that?

No there isn't a formula for an optimum buffer size. It will depend on variables that you can't quantify.

Share:
11,861
Vivek
Author by

Vivek

Updated on June 04, 2022

Comments

  • Vivek
    Vivek almost 2 years

    I would like to read a huge binary file ( ~100GB ) efficiently in Java. I have to process each line of it . The line processing will be in separate threads. I don't want to load the whole file into memory. Does reading in chunks work? What will be the optimum buffer size? Any formula for that?

  • Vivek
    Vivek over 7 years
    The file is a mainframe one which has IBM encoded data. This file is converted to binary format which means data contains certain symbols like ¥€ etc. It is stored in windows folder as a txt file. So one can say it is text . sorry for the confusion.
  • AngelThread
    AngelThread about 4 years
    I don't think this is an efficient way of reading a big file since foreach is not beiing lazy initialized.