Fast CSV parsing
Solution 1
Apache Commons CSV
Have you seen Apache Commons CSV?
Caveat On Using split
Bear in mind is that split
only returns a view of the data, meaning that the original line
object is not eligible for garbage collection whilst there is a reference to any of its views. Perhaps making a defensive copy will help? (Java bug report)
It also is not reliable in grouping escaped CSV columns containing commas
Solution 2
opencsv
Take a look at opencsv.
This blog post, opencsv is an easy CSV parser, has example usage.
Solution 3
The problem of your code is that it's using replaceAll and split which are very costly operation. You should definitely consider using a csv parser/reader that would do a one pass parsing.
There is a benchmark on github
https://github.com/uniVocity/csv-parsers-comparison
that unfortunately is ran under java 6. The number are slightly different under java 7 and 8. I'm trying to get more detail data for different file size but it's work in progress
see https://github.com/arnaudroger/csv-parsers-comparison
Solution 4
Apart from the suggestions made above, I think you can try improving your code by using some threading and concurrency.
Following is the brief analysis and suggested solution
- From the code it seems that you are reading the data over the network (most possibly apache-common-httpclient lib).
- You need to make sure that bottleneck that you are saying is not in the data transfer over the network.
- One way to see is just dump the data in some file (without parsing) and see how much does it take. This will give you an idea how much time is actually spent in parsing (when compared to current observation).
- Now have a look at how java.util.concurrent package is used. Some of the link that you can use are (1,2)
- What you ca do is the tasks that you are doing in for loop can be executed in a thread.
- Using the threadpool and concurrency will greatly improve your performance.
Though the solution involves some effort, but at the end this will surly help you.
Solution 5
opencsv
You should have a look at OpenCSV. I would expect that they have performance optimizations.
Lukasz Madon
I like python. Software Engineer @ rolepoint.com SOreadytohelp
Updated on July 05, 2022Comments
-
Lukasz Madon almost 2 years
I have a java server app that download CSV file and parse it. The parsing can take from 5 to 45 minutes, and happens each hour.This method is a bottleneck of the app so it's not premature optimization. The code so far:
client.executeMethod(method); InputStream in = method.getResponseBodyAsStream(); // this is http stream String line; String[] record; reader = new BufferedReader(new InputStreamReader(in), 65536); try { // read the header line line = reader.readLine(); // some code while ((line = reader.readLine()) != null) { // more code line = line.replaceAll("\"\"", "\"NULL\""); // Now remove all of the quotes line = line.replaceAll("\"", ""); if (!line.startsWith("ERROR"){ //bla bla continue; } record = line.split(","); //more error handling // build the object and put it in HashMap } //exceptions handling, closing connection and reader
Is there any existing library that would help me to speed up things? Can I improve existing code?
-
Xavier Combelle over 12 yearsif the bottleneck is transfert over network you should consider specify gzip header
-
Guy about 12 yearsWe have very bad experience with opencsv. we found it to be both slow and buggy. ended up wasting half a day, and replacing it alltogether.
-
Kai about 12 yearsok... you may want to add more details to make this information relevant. What problems did you have? Which version did you use? Which other framework did you choose? I'm just wondering because I've seen it in more than one project where it did a good job.
-
Guy almost 12 yearsThe main problem was it was returning the wrong number of fields (i.e. I got a 2 field string[] on a 10 fields line) for certain lines. I never got to understand why it happened, but I am guessing it relates somehow to bad utf-8 parsing. I have replaced it with my own read-line-by-line, String.split each line (I realize there are memory considerations here), which ended up running between 15%-30% faster. I was using opencs v2.3 (java)
-
Basil Bourque over 9 yearsSee comments on similar sibling answer.
-
Malt over 3 years
String.split()
usesString.subsring()
which hasn't returned views in a long, long time (stackoverflow.com/questions/33893655/…)