Concatenating multiple csv files into a single csv with the same header - Python

22,133

Solution 1

If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:

import shutil
import glob


#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort()  # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
    for i, fname in enumerate(allFiles):
        with open(fname, 'rb') as infile:
            if i != 0:
                infile.readline()  # Throw away header on all but first file
            # Block copy rest of file from input to output without parsing
            shutil.copyfileobj(infile, outfile)
            print(fname + " has been imported.")

That's it; shutil.copyfileobj handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize.

This assumes all the CSV files have the same format, encoding, line endings, etc., and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.

Solution 2

Are you required to do this in Python? If you are open to doing this entirely in shell, all you'd need to do is first cat the header row from a randomly selected input .csv file into merged.csv before running your one-liner:

cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 

Solution 3

You don't need pandas for this, just the simple csv module would work fine.

import csv

df_out_filename = 'df_out.csv'
write_headers = True
with open(df_out_filename, 'wb') as fout:
    writer = csv.writer(fout)
    for filename in allFiles:
        with open(filename) as fin:
            reader = csv.reader(fin)
            headers = reader.next()
            if write_headers:
                write_headers = False  # Only write headers once.
                writer.writerow(headers)
            writer.writerows(reader)  # Write all remaining rows.
Share:
22,133
mattblack
Author by

mattblack

Updated on July 05, 2022

Comments

  • mattblack
    mattblack almost 2 years

    I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).

    #import csv files from folder
    path =r'data/US/market/merged_data'
    allFiles = glob.glob(path + "/*.csv")
    stockstats_data = pd.DataFrame()
    list_ = []
    
    for file_ in allFiles:
        df = pd.read_csv(file_,index_col=None,)
        list_.append(df)
        stockstats_data = pd.concat(list_)
        print(file_ + " has been imported.")
    

    This code works fine, but it is slow. It can take up to 2 days to process.

    I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.

     for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 
    

    Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.

    Thanks.