Fast CSV parsing

28,407

Solution 1

Apache Commons CSV

Have you seen Apache Commons CSV?

Caveat On Using split

Bear in mind is that split only returns a view of the data, meaning that the original line object is not eligible for garbage collection whilst there is a reference to any of its views. Perhaps making a defensive copy will help? (Java bug report)

It also is not reliable in grouping escaped CSV columns containing commas

Solution 2

opencsv

Take a look at opencsv.

This blog post, opencsv is an easy CSV parser, has example usage.

Solution 3

The problem of your code is that it's using replaceAll and split which are very costly operation. You should definitely consider using a csv parser/reader that would do a one pass parsing.

There is a benchmark on github

https://github.com/uniVocity/csv-parsers-comparison

that unfortunately is ran under java 6. The number are slightly different under java 7 and 8. I'm trying to get more detail data for different file size but it's work in progress

see https://github.com/arnaudroger/csv-parsers-comparison

Solution 4

Apart from the suggestions made above, I think you can try improving your code by using some threading and concurrency.

Following is the brief analysis and suggested solution

  1. From the code it seems that you are reading the data over the network (most possibly apache-common-httpclient lib).
  2. You need to make sure that bottleneck that you are saying is not in the data transfer over the network.
  3. One way to see is just dump the data in some file (without parsing) and see how much does it take. This will give you an idea how much time is actually spent in parsing (when compared to current observation).
  4. Now have a look at how java.util.concurrent package is used. Some of the link that you can use are (1,2)
  5. What you ca do is the tasks that you are doing in for loop can be executed in a thread.
  6. Using the threadpool and concurrency will greatly improve your performance.

Though the solution involves some effort, but at the end this will surly help you.

Solution 5

opencsv

You should have a look at OpenCSV. I would expect that they have performance optimizations.

Share:
28,407
Lukasz Madon
Author by

Lukasz Madon

I like python. Software Engineer @ rolepoint.com SOreadytohelp

Updated on July 05, 2022

Comments

  • Lukasz Madon
    Lukasz Madon almost 2 years

    I have a java server app that download CSV file and parse it. The parsing can take from 5 to 45 minutes, and happens each hour.This method is a bottleneck of the app so it's not premature optimization. The code so far:

            client.executeMethod(method);
            InputStream in = method.getResponseBodyAsStream(); // this is http stream
    
            String line;
            String[] record;
    
            reader = new BufferedReader(new InputStreamReader(in), 65536);
    
            try {
                // read the header line
                line = reader.readLine();
                // some code
                while ((line = reader.readLine()) != null) {
                     // more code
    
                     line = line.replaceAll("\"\"", "\"NULL\"");
    
                     // Now remove all of the quotes
                     line = line.replaceAll("\"", "");     
    
    
                     if (!line.startsWith("ERROR"){
                       //bla bla 
                        continue;
                     }
    
                     record = line.split(",");
                     //more error handling
                     // build the object and put it in HashMap
             }
             //exceptions handling, closing connection and reader
    

    Is there any existing library that would help me to speed up things? Can I improve existing code?

  • Xavier Combelle
    Xavier Combelle over 12 years
    if the bottleneck is transfert over network you should consider specify gzip header
  • Guy
    Guy about 12 years
    We have very bad experience with opencsv. we found it to be both slow and buggy. ended up wasting half a day, and replacing it alltogether.
  • Kai
    Kai about 12 years
    ok... you may want to add more details to make this information relevant. What problems did you have? Which version did you use? Which other framework did you choose? I'm just wondering because I've seen it in more than one project where it did a good job.
  • Guy
    Guy almost 12 years
    The main problem was it was returning the wrong number of fields (i.e. I got a 2 field string[] on a 10 fields line) for certain lines. I never got to understand why it happened, but I am guessing it relates somehow to bad utf-8 parsing. I have replaced it with my own read-line-by-line, String.split each line (I realize there are memory considerations here), which ended up running between 15%-30% faster. I was using opencs v2.3 (java)
  • Basil Bourque
    Basil Bourque over 9 years
    See comments on similar sibling answer.
  • Malt
    Malt over 3 years
    String.split() uses String.subsring() which hasn't returned views in a long, long time (stackoverflow.com/questions/33893655/…)