Merging contents of multiple .csv files into single .csv file

shell text-processing csv

79,926

Solution 1

Here's a perl script that reads in each line of each file specified on the command line and appends it to elements in the array (@csv). When there's no more input, it prints out each element of @csv.

The .csv files will be appended in the order that they are listed on the command line.

WARNING: This script assumes that all input files have the same number of lines. Output will likely be unusable if any file has a different number of lines from any of the others.

#!/usr/bin/perl

use strict;

my @csv=();

foreach (@ARGV) {
  my $linenum=0;

  open(F,"<",$_) or die "couldn't open $_ for read: $!\n";

  while (<F>) {
    chomp;
    $csv[$linenum++] .= "," . $_;
  };

  close(F);
};

foreach (@csv) {
  s/^,//;   # strip leading comma from line
  print $_,"\n";
};

Given the following input files:

==> 1.csv <==
1,2,3,4
1,2,3,4
1,2,3,4
1,2,3,4

==> 2.csv <==
5,6,7,8
5,6,7,8
5,6,7,8
5,6,7,8

==> 3.csv <==
9,10,11,12
9,10,11,12
9,10,11,12
9,10,11,12

it will produce the following output:

$ ./mergecsv.pl *.csv
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12

OK, now that you've read this far it's time to admit that this doesn't do anything that paste -d, *.csv doesn't also do. So why bother with perl? paste is quite inflexible. If your data is exactly right for what paste does, you're good - it's perfect for the job and very fast. If not, it's completely useless to you.

There are any number of ways a perl script like this could be improved (e.g. handling files of different lengths by counting the number of fields for each file and adding the correct number of empty fields to @csv for each of the file(s) which are missing lines. or at least detecting different lengths and exiting with an error) but this is a reasonable starting point if more sophisticated merging is required.

BTW, this uses a really simple algorithm and stores the entire contents of all input files in memory (in @csv) at once. For files up to a few MB each on a modern system, that's not unreasonable. If, however, you are processing HUGE .csv files, a better algorithm would be to:

open all the input files and, while there's still input to read:
- read a line from each file
- append the lines (in @ARGV order)
- print the appended line

Solution 2

The simplest approach for achieving that would be typing the following command

cat *csv > combined.csv

This file would contain the contents of all your csv files just in the way you mentioned.

Solution 3

awk '(NR == 1) || (FNR > 1)' *.csv > 1000Plus5years_companies_data.csv

Solution 4

Use csvstack from csvkit:

 csvstack *.csv  > out.csv

Solution 5

use paste

paste -d ',' file1.csv file2.csv ... fileN.csv

View more solutions

79,926

rmb

Updated on September 18, 2022

Comments

rmb over 1 year

I want to write a script that merges contents of several .csv files in one .csv file, i.e appends columns of all other files to the columns of first file. I had tried doing so using a "for" loop but was not able to proceed with it.

Does anyone know how to do this in Linux?
- Ulrich Schwarz almost 8 years
  
  By merge, you mean that you want to combine lines from file A and file B if, for example, their first entry is the same? In that case, join might be for you, otherwise, errr, cat?
- Stephen Harris almost 8 years
  
  You need to explain the type of merge you want; adding file2's columns as new columns to file1? Or ading file2's rows after file1? Or matching on an index key? Or...?
- rmb almost 8 years
  
  Appending columns of all other files to the columns of first file.
- Stephen Harris almost 8 years
  
  paste may be what you're looking for
- terdon almost 8 years
  
  Please edit your question and show us an example of your input file(s) and desired output file(s). There are many ways to "merge" csv files.
Uri Cohen almost 8 years

Wouldn't this copy the rows in a common file instead of the columns?
Kusalananda almost 8 years

@fschmitt Copying the rows is appending to the columns, right?
Alessio almost 8 years

@Kusalananda no, it's just concatenating the files.
Kusalananda almost 8 years

@cas Matter of interpretation. If I have a row and append values to it, they go on the same row. If I have a column and append values to it, they go in the same column. If I want to append the columns to columns, do they go as new columns, or are the appended to the end of the existing ones?
Alessio almost 8 years

In this particular case, what the OP wants is: for each individual line of all input files, append the fields line to make one long line with all the columns, and then repeat the process for all subsequent lines. This is hard to describe unambiguously in plain English - the algorithm in my perl script (or the more memory-efficient algorithm described after it) explains it better and is easier to follow.
hLk over 4 years

This answer will duplicate the headers. Use head -n 1 file1.csv > combined.out && tail -n+2 -q *.csv >> combined.out where file1.csv is any of the files you want merged. This will merge all the CSVs into one like this answer, but with only one set of headers at the top. Assumes that all CSVs share headers. It is called combined.out to prevent the statements from conflicting.
Kusalananda over 4 years

This would be a useful answer if you described what the code is doing and what expectations you have on the input data.
Nick over 4 years

I found this useful as well so I'll expand... Both NR and FNR represent the number of the line being processed (1 based index). FNR is the current line within each File while NR is the current total Number of Records across all files. So (NR == 1) is including the the first line of the first file (header), while (FNR > 1) skips the first line of each subsequent file.
Rohit Salecha almost 4 years

Yes I've mentioned that as a pre-req that it must be unique data only
Paulo Sergio Schlogl over 3 years

that doesnt work. It just concatenate one csv below the other and not column side by side.
Tomáš M. over 2 years

This the only solution that deals with the non-obvious cases, even though it requires a dedicated tool. Thanks! Exactly what I was looking for.
M.Viking over 2 years

csvstack works great combined with {00..99} bash globbing to merge multi part SPARK CSV files csvstack part-{00..99}-abcdef.csv > out.csv
nealmcb about 2 years

This does what I was looking for: combine all the rows of multiple csv files which have the same header (which only is output once). That is vs the clarified original post which wants columns merged.