Best way to work with large amounts of CSV data quickly

13,589

Solution 1

how about using a database.

jam the records into tables, and then query them out using joins.

the import might take awhile, but the DB engine will be optimized for the join and retrieval part...

Solution 2

Here is some ruby code I wrote to process large csv files (~180mb in my case).

https://gist.github.com/1323865

A standard FasterCSV.parse pulling it all into memory was taking over an hour. This got it down to about 10 minutes.

The relevant part is this:

lines = []
IO.foreach('/tmp/zendesk_tickets.csv') do |line|
  lines << line
  if lines.size >= 1000
    lines = FasterCSV.parse(lines.join) rescue next
    store lines
    lines = []
  end
end
store lines

IO.foreach doesn't load the entire file into memory and just steps through it with a buffer. When it gets to 1000 lines, it tries parsing a csv and inserting just those rows. One tricky part is the "rescue next". If your CSV has some fields that span multiple lines, you may need to grab a few more lines to get a valid parseable csv string. Otherwise the line you're on could be in the middle of a field.

In the gist you can see one other nice optimization which uses MySQL's update ON DUPLICATE KEY. This allows you to insert in bulk and if a duplicate key is detected it simply overwrites the values in that row instead of inserting a new row. You can think of it like a create/update in one query. You'll need to set a unique index on at least one column for this to work.

Solution 3

10M+ rows doesn't really sound like that much. If you can preload the contents of the files and match up the data in memory with decent data structures (you'll want maps at some point), you won't have to keep running through the CSV files over and over. File access is SLOW.

Solution 4

Two reasonably fast options:

  1. Put your data into sqlite DB. Then it's a simple query with pair of join that would perform way faster than anything you could write yourself -- SQL is very good for this kind of tasks.

  2. Assuming your additional CSV files are small enough to fit into RAM, you can read everything into hash, using customer ID as a key, then look up that hash when processing main file with 10+M records. Note that it's only necessary to put lookup data into RAM, main list can be processed in small branches.

Solution 5

My experience is that with Ruby, prepare to have about 10x memory usage of the actual payload. Of course, with current amounts of RAM, if the process loads only one file at a time, 10MB is almost negligible even when multiplied by ten :)

If you can read one line at a time (which is easy with File instances), you could use FasterCSV and write one line at a time as well. That would make memory consumption O(1) instead of O(n). But with 10 megabyte files you can probably slurp that file to memory and write it to CSV in one pass, given only few processes at any given time.

Share:
13,589

Related videos on Youtube

NJ.
Author by

NJ.

Updated on October 29, 2020

Comments

  • NJ.
    NJ. over 3 years

    I have large CSV datasets (10M+ lines) that need to be processed. I have two other files that need to be referenced for the output—they contain data that amplifies what we know about the millions of lines in the CSV file. The goal is to output a new CSV file that has each record merged with the additional information from the other files.

    Imagine that the large CSV file has transactions but the customer information and billing information is recorded in two other files and we want to output a new CSV that has each transaction linked to the customer ID and account ID, etc.

    A colleague has a functional program written in Java to do this but it is very slow. The reason is that the CSV file with the millions of lines has to be walked through many, many, many times apparently.

    My question is—yes, I am getting to it—how should I approach this in Ruby? The goal is for it to be faster (18+ hours right now with very little CPU activity)

    Can I load this many records into memory? If so, how should I do it?

    I know this is a little vague. Just looking for ideas as this is a little new to me.

    • PJP
      PJP about 13 years
      As two (so far) of the answers have said, a database is a good solution. Postgres or MySQL can easily handle this, including doing bulk loads of the files. They'd import the data very fast, well within the 18 hours you're currently seeing, and, once the data was in the database they'd be able to create your output records on the fly. 10+million records is nothing to a database. :-)
  • NJ.
    NJ. about 13 years
    That's what I am thinking. What sort of structures would you use?
  • Lightness Races in Orbit
    Lightness Races in Orbit about 13 years
    @NJ: Without knowing what your algorithms are, I couldn't possibly even begin to guess.
  • PJP
    PJP about 13 years
    The problem with slurping 10M+ lines in one shot is not so much using the available RAM, though that is a scalability issue, it's memory allocation as the data is read. Way back when I did some benchmarks here showing how to read huge files fast, then process them line by line. FINDING the benchmarks will be the fun part now.
  • PJP
    PJP about 13 years
    Ah, found it. It's part of Ruby: start reading at arbitrary point in large file. My explanation is that Ruby reads in set sizes, then finds it needs more and more RAM, so it has to reacquire memory and move the variables around. By predefining how much to read it grabs everything it needs up front and then streams the incoming data directly into the buffer. Oh, "10 megabyte files", the OP says they're 10M+ lines not bytes.
  • EdvardM
    EdvardM about 13 years
    Yes, and remember you can preallocate size by saying a = Array.new(n) # allocate n slots immediately
  • PJP
    PJP about 13 years
    That would help if the data was being read as an Array. Read doesn't do that, it only pulls it in as a string. Subsequent processing could split it but usually iterating over String#each_line will be sufficient and avoids additional copying of data.
  • Jordan Rumpelstiltskin Nemrow
    Jordan Rumpelstiltskin Nemrow almost 10 years
    Using 'IO.foreach' on a 1G file was 5X faster than using 'CSV.foreach'.
  • OpenCoderX
    OpenCoderX over 9 years
    What about the last group of lines that is < 1000?
  • whizcreed
    whizcreed over 8 years
    @holaSenor store lines at the end takes care of the remaining group too
  • psychoslave
    psychoslave over 2 years
    Excellent solution. I coupled it with Rails 6 insert_all and a prefetch of entity ids I had too look up for constructing records with foreign keys and the result is now amazingly fast. :)

Related