Import multiple csv files into pandas and concatenate into one DataFrame

694,172

Solution 1

If you have same columns in all your csv files then you can try the code below. I have added header=0 so that after reading csv first row can be assigned as the column names.

import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

Solution 2

An alternative to darindaCoder's answer:

path = r'C:\DRO\DCL_rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent

df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one

Solution 3

import glob
import os
import pandas as pd   
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))

Solution 4

Almost all of the answers here are either unnecessarily complex (glob pattern matching) or rely on additional 3rd party libraries. You can do this in 2 lines using everything Pandas and python (all versions) already have built in.

For a few files - 1 liner

df = pd.concat(map(pd.read_csv, ['d1.csv', 'd2.csv','d3.csv']))

For many files

import os

filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))

For No Headers

If you have specific things you want to change with pd.read_csv (i.e. no headers) you can make a separate function and call that with your map:

def f(i):
    return pd.read_csv(i, header=None)

df = pd.concat(map(f, filepaths))

This pandas line which sets the df utilizes 3 things:

  1. Python's map (function, iterable) sends to the function (the pd.read_csv()) the iterable (our list) which is every csv element in filepaths).
  2. Panda's read_csv() function reads in each CSV file as normal.
  3. Panda's concat() brings all these under one df variable.

Solution 5

Easy and Fast

Import two or more csv's without having to make a list of names.

import glob
import pandas as pd

df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv')))
Share:
694,172
jonas
Author by

jonas

Updated on July 08, 2022

Comments

  • jonas
    jonas almost 2 years

    I would like to read several csv files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I have so far:

    import glob
    import pandas as pd
    
    # get data file names
    path =r'C:\DRO\DCL_rawdata_files'
    filenames = glob.glob(path + "/*.csv")
    
    dfs = []
    for filename in filenames:
        dfs.append(pd.read_csv(filename))
    
    # Concatenate all data into one DataFrame
    big_frame = pd.concat(dfs, ignore_index=True)
    

    I guess I need some help within the for loop???

  • Hexatonic
    Hexatonic over 8 years
    This seems like an old fashioned aka manual way of doing things, esp. as the Hapood ecosystem has growing list of tools where you can perform sql queries directly on many different directories containing different file types (csv, json, txt, databases) as if it was one data source. There must be something similar in python, since it has had a 20 year jump start on doing "big data".
  • Sid
    Sid over 8 years
    The same thing more concise, and perhaps faster as it doesn't use a list: df = pd.concat((pd.read_csv(f) for f in all_files)) Also, one should perhaps use os.path.join(path, "*.csv") instead of path + "/*.csv", which makes it OS independent.
  • ivan_pozdeev
    ivan_pozdeev over 8 years
    Any numbers to back the "speed up"? Specifically, is it faster than stackoverflow.com/questions/20906474/… ?
  • pydsigner
    pydsigner over 8 years
    I don't see the OP asking for a way to speed up his concatenation, this just looks like a rework of a pre-existing accepted answer.
  • curtisp
    curtisp over 7 years
    Using this answer allowed me to add new column with the file name eg with df['filename'] = os.path.basename(file_) in the for file_ loop .. not sure if Sid's answer allows this?
  • Dr Fabio Gori
    Dr Fabio Gori over 7 years
    @Mike @Sid the final two lines can be replaced by: pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True). The inner brackets are required by Pandas version 0.18.1
  • C8H10N4O2
    C8H10N4O2 about 7 years
    @curtisp you can still do that with Sid's answer, just use pandas.read_csv(f).assign(filename = foo) inside the generator. assign will return the entire dataframe including the new column filename
  • toto_tico
    toto_tico almost 7 years
    I recommend using glob.iglob instead of glob.glob; The first one returns and iterator (instead of a list).
  • rafaelvalle
    rafaelvalle over 6 years
    Excellent one liner, specially useful if no read_csv arguments are needed!
  • Pimin Konstantin Kefaloukos
    Pimin Konstantin Kefaloukos over 6 years
    That won't work if the data has mixed columns types.
  • fiedl
    fiedl about 6 years
    If, on the other hand, arguments are needed, this can be done with lambdas: df = pd.concat(map(lambda file: pd.read_csv(file, delim_whitespace=True), data_files))
  • FrankC
    FrankC almost 6 years
    @SKG perfect.. this is the only working solution for me. 500 files 400k rows total in 2 secs. Thanks for posting it.
  • muon
    muon over 5 years
    or just df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv))
  • cs95
    cs95 about 5 years
    ^ or with functools.partial, to avoid lambdas
  • gustafbstrom
    gustafbstrom over 4 years
    If you have many files, I'd use a generator instead of importing + appending to a list before concatenating them all.
  • cadip92
    cadip92 over 4 years
    I tried the method prescribed by @muon. But, i have multiple files with headers(headers are common). I don't want them to be concatenated in the dataframe. Do you know how can i do that ? I tried df = pd.concat(map(pd.read_csv(header=0), glob.glob('data/*.csv)) but it gave an error "parser_f() missing 1 required positional argument: 'filepath_or_buffer'"
  • sigma1510
    sigma1510 about 4 years
    Note that glob.glob() won't preserve the order of your files, so you need to throw a quick sorted(all_files) in there for that.
  • curtisp
    curtisp almost 4 years
    This was first clear answer I was able to find hat described combining multiple csv into list, then convert combined to dataframe without having to define dataframe columns first. I modified this answer for my use case combining multiple requests.get(url) csv responses by replacing filename with ` io.StringIO(response.content.decode('utf-8'))`
  • J. Velazquez-Muriel
    J. Velazquez-Muriel almost 4 years
    1500 files and 750k rows in 5 secs. Excellent @SKG
  • Shiv Krishna Jaiswal
    Shiv Krishna Jaiswal almost 3 years
    Similar to this, there should be a function in pandas API for reading multiple files in a dir. Apparently it does not have it, as now.
  • delimiter
    delimiter almost 3 years
    How do we pass arguments to this syntax?
  • robmsmt
    robmsmt over 2 years
    It's a little while since you asked... but I updated my answer to include answers without headers (or if you want to pass any change to read_csv).
  • Milan
    Milan over 2 years
    My answer: stackoverflow.com/a/69994928/10358768, inspired from this particular answer!
  • Gobrel
    Gobrel over 2 years
    I use this solution to combine multiple excel files. The files have latitude and logitude and when I use pd.read_excel to check each file both of the values are read correct as float. When I use your solution to convert the files into one datafrage latitude is always an object and only longitude is correct as float. Any ideas why this is so?
  • kristianp
    kristianp over 2 years
    I bet this is a lot faster than using pandas concat!
  • BGG16
    BGG16 over 2 years
    @delimiter, to insert the file path to your docs, replace the word 'data' with your file path, and keep the / at the end.
  • Krishnaap
    Krishnaap about 2 years
    ' np_array_list.append(df.as_matrix())' showing an error AttributeError: 'DataFrame' object has no attribute 'as_matrix'
  • xmar
    xmar about 2 years
    Does this work with AWS S3 paths too?