Merging contents of multiple .csv files into single .csv file

79,926

Solution 1

Here's a perl script that reads in each line of each file specified on the command line and appends it to elements in the array (@csv). When there's no more input, it prints out each element of @csv.

The .csv files will be appended in the order that they are listed on the command line.

WARNING: This script assumes that all input files have the same number of lines. Output will likely be unusable if any file has a different number of lines from any of the others.

#!/usr/bin/perl

use strict;

my @csv=();

foreach (@ARGV) {
  my $linenum=0;

  open(F,"<",$_) or die "couldn't open $_ for read: $!\n";

  while (<F>) {
    chomp;
    $csv[$linenum++] .= "," . $_;
  };

  close(F);
};

foreach (@csv) {
  s/^,//;   # strip leading comma from line
  print $_,"\n";
};

Given the following input files:

==> 1.csv <==
1,2,3,4
1,2,3,4
1,2,3,4
1,2,3,4

==> 2.csv <==
5,6,7,8
5,6,7,8
5,6,7,8
5,6,7,8

==> 3.csv <==
9,10,11,12
9,10,11,12
9,10,11,12
9,10,11,12

it will produce the following output:

$ ./mergecsv.pl *.csv
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12

OK, now that you've read this far it's time to admit that this doesn't do anything that paste -d, *.csv doesn't also do. So why bother with perl? paste is quite inflexible. If your data is exactly right for what paste does, you're good - it's perfect for the job and very fast. If not, it's completely useless to you.

There are any number of ways a perl script like this could be improved (e.g. handling files of different lengths by counting the number of fields for each file and adding the correct number of empty fields to @csv for each of the file(s) which are missing lines. or at least detecting different lengths and exiting with an error) but this is a reasonable starting point if more sophisticated merging is required.

BTW, this uses a really simple algorithm and stores the entire contents of all input files in memory (in @csv) at once. For files up to a few MB each on a modern system, that's not unreasonable. If, however, you are processing HUGE .csv files, a better algorithm would be to:

  • open all the input files and, while there's still input to read:
    • read a line from each file
    • append the lines (in @ARGV order)
    • print the appended line

Solution 2

The simplest approach for achieving that would be typing the following command

cat *csv > combined.csv

This file would contain the contents of all your csv files just in the way you mentioned.

Solution 3

awk '(NR == 1) || (FNR > 1)' *.csv > 1000Plus5years_companies_data.csv

Solution 4

Use csvstack from csvkit:

 csvstack *.csv  > out.csv

Solution 5

use paste

paste -d ',' file1.csv file2.csv ... fileN.csv
Share:
79,926

Related videos on Youtube

rmb
Author by

rmb

Updated on September 18, 2022

Comments

  • rmb
    rmb over 1 year

    I want to write a script that merges contents of several .csv files in one .csv file, i.e appends columns of all other files to the columns of first file. I had tried doing so using a "for" loop but was not able to proceed with it.

    Does anyone know how to do this in Linux?

    • Ulrich Schwarz
      Ulrich Schwarz almost 8 years
      By merge, you mean that you want to combine lines from file A and file B if, for example, their first entry is the same? In that case, join might be for you, otherwise, errr, cat?
    • Stephen Harris
      Stephen Harris almost 8 years
      You need to explain the type of merge you want; adding file2's columns as new columns to file1? Or ading file2's rows after file1? Or matching on an index key? Or...?
    • rmb
      rmb almost 8 years
      Appending columns of all other files to the columns of first file.
    • Stephen Harris
      Stephen Harris almost 8 years
      paste may be what you're looking for
    • terdon
      terdon almost 8 years
      Please edit your question and show us an example of your input file(s) and desired output file(s). There are many ways to "merge" csv files.
  • Uri Cohen
    Uri Cohen almost 8 years
    Wouldn't this copy the rows in a common file instead of the columns?
  • Kusalananda
    Kusalananda almost 8 years
    @fschmitt Copying the rows is appending to the columns, right?
  • Alessio
    Alessio almost 8 years
    @Kusalananda no, it's just concatenating the files.
  • Kusalananda
    Kusalananda almost 8 years
    @cas Matter of interpretation. If I have a row and append values to it, they go on the same row. If I have a column and append values to it, they go in the same column. If I want to append the columns to columns, do they go as new columns, or are the appended to the end of the existing ones?
  • Alessio
    Alessio almost 8 years
    In this particular case, what the OP wants is: for each individual line of all input files, append the fields line to make one long line with all the columns, and then repeat the process for all subsequent lines. This is hard to describe unambiguously in plain English - the algorithm in my perl script (or the more memory-efficient algorithm described after it) explains it better and is easier to follow.
  • hLk
    hLk over 4 years
    This answer will duplicate the headers. Use head -n 1 file1.csv > combined.out && tail -n+2 -q *.csv >> combined.out where file1.csv is any of the files you want merged. This will merge all the CSVs into one like this answer, but with only one set of headers at the top. Assumes that all CSVs share headers. It is called combined.out to prevent the statements from conflicting.
  • Kusalananda
    Kusalananda over 4 years
    This would be a useful answer if you described what the code is doing and what expectations you have on the input data.
  • Nick
    Nick over 4 years
    I found this useful as well so I'll expand... Both NR and FNR represent the number of the line being processed (1 based index). FNR is the current line within each File while NR is the current total Number of Records across all files. So (NR == 1) is including the the first line of the first file (header), while (FNR > 1) skips the first line of each subsequent file.
  • Rohit Salecha
    Rohit Salecha almost 4 years
    Yes I've mentioned that as a pre-req that it must be unique data only
  • Paulo Sergio Schlogl
    Paulo Sergio Schlogl over 3 years
    that doesnt work. It just concatenate one csv below the other and not column side by side.
  • Tomáš M.
    Tomáš M. over 2 years
    This the only solution that deals with the non-obvious cases, even though it requires a dedicated tool. Thanks! Exactly what I was looking for.
  • M.Viking
    M.Viking over 2 years
    csvstack works great combined with {00..99} bash globbing to merge multi part SPARK CSV files csvstack part-{00..99}-abcdef.csv > out.csv
  • nealmcb
    nealmcb about 2 years
    This does what I was looking for: combine all the rows of multiple csv files which have the same header (which only is output once). That is vs the clarified original post which wants columns merged.