What's the fastest way to merge multiple csv files by column?

10,837

Solution 1

[...] transposing each csv file and re-saving to disk, and then using the command line to concatenate them [...]

Sounds like Transpose-Cat-Transpose. Use paste for joining files horizontally.

paste -d ',' a.csv b.csv c.csv ... > result.csv

Solution 2

The Python csv module can be set up so that each record is a dictionary with the column names as keys. You should that way be able to read in all the files as dictionaries, and write them to an out-file that has all columns.

Python is easy to use, so this should be fairly trivial for a programmer of any language.

If your csv-files doesn't have column headings, this will be quite a lot of manual work, though, so then it's perhaps not the best solution.

Since these files are fairly big, it's best not to read all of them into memory once. I'd recommend that you first open them only to collect all column names into a list, and use that list to create the output file. Then you can concatenate each input file to the output file without having to have all of the files in memory.

Solution 3

Use Go: https://github.com/chrislusf/gleam

Assume there are file "a.csv" has fields "a1, a2, a3, a4, a5".

And assume file "b.csv" has fields "b1, b2, b3".

We want to join the rows where a1 = b2. And the output format should be "a1, a4, b3".

package main

import (
    "os"

    "github.com/chrislusf/gleam"
    "github.com/chrislusf/gleam/source/csv"
)

func main() {

    f := gleam.New()
    a := f.Input(csv.New("a.csv")).Select(1,4) // a1, a4
    b := f.Input(csv.New("b.csv")).Select(2,3) // b2, b3

    a.Join(b).Fprintf(os.Stdout, "%s,%s,%s\n").Run()  // a1, a4, b3

}

Solution 4

Horizontal concatenation really is trivial. Considering you know C++, I'm surprised you used MATLAB. Processing a GB or so of data in the way you're doing should be in the order of seconds, not days.

By your description, no CSV processing is actually required. The easiest approach is to just do it in RAM.

vector< vector<string> > data( num_files );

for( int i = 0; i < num_files; i++ ) {
    ifstream input( filename[i] );
    string line;
    while( getline(input, line) ) data[i].push_back(line);
}

(Do obvious sanity checks, such as making sure all vectors are the same length...)

Now you have everything, dump it:

ofstream output("concatenated.csv");

for( int row = 0; row < num_rows; row++ ) {
    for( int f = 1; f < num_files; f++ ) {
        if( f == 0 ) output << ",";
        output << data[f][row];
    }
    output << "\n";
}

If you don't want to use all that RAM, you can do it one line at a time. You should be able to keep all files open at once, and just store the ifstream objects in a vector/array/list. In that case, you just read one line at a time from each file and write it to the output.

Solution 5

import csv
import itertools

# put files in the order you want concatentated
csv_names = [...whatever...] 

readers = [csv.reader(open(fn, 'rb')) for fn in csv_names]
writer = csv.writer(open('result.csv', 'wb'))

for row_chunks in itertools.izip(*readers):
    writer.writerow(list(itertools.chain.from_iterable(row_chunks)))

Concatenates horizontally. Assumes all files have the same length. Has low memory overhead and is speedy.

Answer applies to Python 2. In Python 3, opening csv files is slightly different:

readers = [csv.reader(open(fn, 'r'), newline='') for fn in csv_names]
writer = csv.writer(open('result.csv', 'w'), newline='')
Share:
10,837
ankit
Author by

ankit

Updated on June 11, 2022

Comments

  • ankit
    ankit about 2 years

    I have about 50 CSV files with 60,000 rows in each, and a varying number of columns. I want to merge all the CSV files by column. I've tried doing this in MATLAB by transposing each csv file and re-saving to disk, and then using the command line to concatenate them. This took my computer over a week and the final result needs to transposed once again! I have to do this again, and I'm looking for a solution that won't take another week. Any help would be appreciated.

  • ankit
    ankit almost 11 years
    My files do have headings. I just want all the files concatenated horizontally. Is this do-able quickly for the sizes of the files involved? (check comment to the question).
  • Lennart Regebro
    Lennart Regebro almost 11 years
    @ankit: Yeah, no problem. Updated the answer. (This is a fun problem, if I had time I'd do it for you, but I don't, I have to cater to customers. :-))
  • ankit
    ankit almost 11 years
    Hey Lennart, before you go, could you just list the names of the functions/methods I'll be needing? I can google their usage and figure out the rest by myself. Thanks :)
  • Lennart Regebro
    Lennart Regebro almost 11 years
    @ankit: Well, it's all here: docs.python.org/2/library/csv.html Make sure you use the docs for your Python version though. There are subtle differences between how you use it in Python 2 and Python 3.
  • Lennart Regebro
    Lennart Regebro almost 11 years
    As I understand it, there is some CSV manipulation involved, meaning you have to parse the CSV as well. So this isn't enough.
  • Tickon
    Tickon almost 10 years
    Great solution - it can even merge very large csv files that wouldn't even load into memory (in python for example). The operation is done line by line, so no memory issues.
  • Thomas Oster
    Thomas Oster about 4 years
    If you need the first column to match, there is also join en.wikipedia.org/wiki/Join_(Unix)