Merging contents of multiple .csv files into single .csv file
Solution 1
Here's a perl script that reads in each line of each file specified on the command line and appends it to elements in the array (@csv
). When there's no more input, it prints out each element of @csv
.
The .csv
files will be appended in the order that they are listed on the command line.
WARNING: This script assumes that all input files have the same number of lines. Output will likely be unusable if any file has a different number of lines from any of the others.
#!/usr/bin/perl
use strict;
my @csv=();
foreach (@ARGV) {
my $linenum=0;
open(F,"<",$_) or die "couldn't open $_ for read: $!\n";
while (<F>) {
chomp;
$csv[$linenum++] .= "," . $_;
};
close(F);
};
foreach (@csv) {
s/^,//; # strip leading comma from line
print $_,"\n";
};
Given the following input files:
==> 1.csv <==
1,2,3,4
1,2,3,4
1,2,3,4
1,2,3,4
==> 2.csv <==
5,6,7,8
5,6,7,8
5,6,7,8
5,6,7,8
==> 3.csv <==
9,10,11,12
9,10,11,12
9,10,11,12
9,10,11,12
it will produce the following output:
$ ./mergecsv.pl *.csv
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
OK, now that you've read this far it's time to admit that this doesn't do anything that paste -d, *.csv
doesn't also do. So why bother with perl? paste
is quite inflexible. If your data is exactly right for what paste
does, you're good - it's perfect for the job and very fast. If not, it's completely useless to you.
There are any number of ways a perl script like this could be improved (e.g. handling files of different lengths by counting the number of fields for each file and adding the correct number of empty fields to @csv
for each of the file(s) which are missing lines. or at least detecting different lengths and exiting with an error) but this is a reasonable starting point if more sophisticated merging is required.
BTW, this uses a really simple algorithm and stores the entire contents of all input files in memory (in @csv
) at once. For files up to a few MB each on a modern system, that's not unreasonable. If, however, you are processing HUGE .csv files, a better algorithm would be to:
- open all the input files and, while there's still input to read:
- read a line from each file
- append the lines (in @ARGV order)
- print the appended line
Solution 2
The simplest approach for achieving that would be typing the following command
cat *csv > combined.csv
This file would contain the contents of all your csv files just in the way you mentioned.
Solution 3
awk '(NR == 1) || (FNR > 1)' *.csv > 1000Plus5years_companies_data.csv
Solution 4
Use csvstack
from csvkit
:
csvstack *.csv > out.csv
Solution 5
use paste
paste -d ',' file1.csv file2.csv ... fileN.csv
Related videos on Youtube
rmb
Updated on September 18, 2022Comments
-
rmb over 1 year
I want to write a script that merges contents of several .csv files in one .csv file, i.e appends columns of all other files to the columns of first file. I had tried doing so using a "for" loop but was not able to proceed with it.
Does anyone know how to do this in Linux?
-
Ulrich Schwarz almost 8 yearsBy merge, you mean that you want to combine lines from file A and file B if, for example, their first entry is the same? In that case,
join
might be for you, otherwise, errr,cat
? -
Stephen Harris almost 8 yearsYou need to explain the type of merge you want; adding file2's columns as new columns to file1? Or ading file2's rows after file1? Or matching on an index key? Or...?
-
rmb almost 8 yearsAppending columns of all other files to the columns of first file.
-
Stephen Harris almost 8 years
paste
may be what you're looking for -
terdon almost 8 yearsPlease edit your question and show us an example of your input file(s) and desired output file(s). There are many ways to "merge" csv files.
-
-
Uri Cohen almost 8 yearsWouldn't this copy the rows in a common file instead of the columns?
-
Kusalananda almost 8 years@fschmitt Copying the rows is appending to the columns, right?
-
Alessio almost 8 years@Kusalananda no, it's just concatenating the files.
-
Kusalananda almost 8 years@cas Matter of interpretation. If I have a row and append values to it, they go on the same row. If I have a column and append values to it, they go in the same column. If I want to append the columns to columns, do they go as new columns, or are the appended to the end of the existing ones?
-
Alessio almost 8 yearsIn this particular case, what the OP wants is: for each individual line of all input files, append the fields line to make one long line with all the columns, and then repeat the process for all subsequent lines. This is hard to describe unambiguously in plain English - the algorithm in my perl script (or the more memory-efficient algorithm described after it) explains it better and is easier to follow.
-
hLk over 4 yearsThis answer will duplicate the headers. Use
head -n 1 file1.csv > combined.out && tail -n+2 -q *.csv >> combined.out
wherefile1.csv
is any of the files you want merged. This will merge all the CSVs into one like this answer, but with only one set of headers at the top. Assumes that all CSVs share headers. It is calledcombined.out
to prevent the statements from conflicting. -
Kusalananda over 4 yearsThis would be a useful answer if you described what the code is doing and what expectations you have on the input data.
-
Nick over 4 yearsI found this useful as well so I'll expand... Both
NR
andFNR
represent the number of the line being processed (1 based index). FNR is the current line within each File while NR is the current total Number of Records across all files. So(NR == 1)
is including the the first line of the first file (header), while(FNR > 1)
skips the first line of each subsequent file. -
Rohit Salecha almost 4 yearsYes I've mentioned that as a pre-req that it must be unique data only
-
Paulo Sergio Schlogl over 3 yearsthat doesnt work. It just concatenate one csv below the other and not column side by side.
-
Tomáš M. over 2 yearsThis the only solution that deals with the non-obvious cases, even though it requires a dedicated tool. Thanks! Exactly what I was looking for.
-
M.Viking over 2 yearscsvstack works great combined with
{00..99}
bash globbing to merge multi part SPARK CSV filescsvstack part-{00..99}-abcdef.csv > out.csv
-
nealmcb about 2 yearsThis does what I was looking for: combine all the rows of multiple csv files which have the same header (which only is output once). That is vs the clarified original post which wants columns merged.