Reading a huge .csv file

221,364

Solution 1

You are reading all rows into a list, then processing that list. Don't do that.

Process your rows as you produce them. If you need to filter the data first, use a generator function:

import csv

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        yield next(datareader)  # yield the header row
        count = 0
        for row in datareader:
            if row[3] == criterion:
                yield row
                count += 1
            elif count:
                # done when having read a consecutive series of rows 
                return

I also simplified your filter test; the logic is the same but more concise.

Because you are only matching a single sequence of rows matching the criterion, you could also use:

import csv
from itertools import dropwhile, takewhile

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        yield next(datareader)  # yield the header row
        # first row, plus any subsequent rows that match, then stop
        # reading altogether
        # Python 2: use `for row in takewhile(...): yield row` instead
        # instead of `yield from takewhile(...)`.
        yield from takewhile(
            lambda r: r[3] == criterion,
            dropwhile(lambda r: r[3] != criterion, datareader))
        return

You can now loop over getstuff() directly. Do the same in getdata():

def getdata(filename, criteria):
    for criterion in criteria:
        for row in getstuff(filename, criterion):
            yield row

Now loop directly over getdata() in your code:

for row in getdata(somefilename, sequence_of_criteria):
    # process row

You now only hold one row in memory, instead of your thousands of lines per criterion.

yield makes a function a generator function, which means it won't do any work until you start looping over it.

Solution 2

Although Martijin's answer is prob best. Here is a more intuitive way to process large csv files for beginners. This allows you to process groups of rows, or chunks, at a time.

import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

Solution 3

I do a fair amount of vibration analysis and look at large data sets (tens and hundreds of millions of points). My testing showed the pandas.read_csv() function to be 20 times faster than numpy.genfromtxt(). And the genfromtxt() function is 3 times faster than the numpy.loadtxt(). It seems that you need pandas for large data sets.

I posted the code and data sets I used in this testing on a blog discussing MATLAB vs Python for vibration analysis.

Solution 4

For someone who lands to this question. Using pandas with ‘chunksize’ and ‘usecols’ helped me to read a huge zip file faster than the other proposed options.

import pandas as pd

sample_cols_to_keep =['col_1', 'col_2', 'col_3', 'col_4','col_5']

# First setup dataframe iterator, ‘usecols’ parameter filters the columns, and 'chunksize' sets the number of rows per chunk in the csv. (you can change these parameters as you wish)
df_iter = pd.read_csv('../data/huge_csv_file.csv.gz', compression='gzip', chunksize=20000, usecols=sample_cols_to_keep) 

# this list will store the filtered dataframes for later concatenation 
df_lst = [] 

# Iterate over the file based on the criteria and append to the list
for df_ in df_iter: 
        tmp_df = (df_.rename(columns={col: col.lower() for col in df_.columns}) # filter eg. rows where 'col_1' value grater than one
                                  .pipe(lambda x:  x[x.col_1 > 0] ))
        df_lst += [tmp_df.copy()] 

# And finally combine filtered df_lst into the final lareger output say 'df_final' dataframe 
df_final = pd.concat(df_lst)

Solution 5

what worked for me was and is superfast is

import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv', usecols=[col1, col2])
df_train=df_train.compute()
print("load train: " , time.clock()-t)

Another working solution is:

import pandas as pd 
from tqdm import tqdm

PATH = '../data/train.csv'
chunksize = 500000 
traintypes = {
'col1':'category',
'col2':'str'}

cols = list(traintypes.keys())

df_list = [] # list to hold the batch dataframe

for df_chunk in tqdm(pd.read_csv(PATH, usecols=cols, dtype=traintypes, chunksize=chunksize)):
    # Can process each chunk of dataframe here
    # clean_data(), feature_engineer(),fit()

    # Alternatively, append the chunk to list and merge all
    df_list.append(df_chunk) 

# Merge all dataframes into one dataframe
X = pd.concat(df_list)

# Delete the dataframe list to release memory
del df_list
del df_chunk
Share:
221,364
Charles Dillon
Author by

Charles Dillon

Updated on May 22, 2020

Comments

  • Charles Dillon
    Charles Dillon about 4 years

    I'm currently trying to read data from .csv files in Python 2.7 with up to 1 million rows, and 200 columns (files range from 100mb to 1.6gb). I can do this (very slowly) for the files with under 300,000 rows, but once I go above that I get memory errors. My code looks like this:

    def getdata(filename, criteria):
        data=[]
        for criterion in criteria:
            data.append(getstuff(filename, criteron))
        return data
    
    def getstuff(filename, criterion):
        import csv
        data=[]
        with open(filename, "rb") as csvfile:
            datareader=csv.reader(csvfile)
            for row in datareader: 
                if row[3]=="column header":
                    data.append(row)
                elif len(data)<2 and row[3]!=criterion:
                    pass
                elif row[3]==criterion:
                    data.append(row)
                else:
                    return data
    

    The reason for the else clause in the getstuff function is that all the elements which fit the criterion will be listed together in the csv file, so I leave the loop when I get past them to save time.

    My questions are:

    1. How can I manage to get this to work with the bigger files?

    2. Is there any way I can make it faster?

    My computer has 8gb RAM, running 64bit Windows 7, and the processor is 3.40 GHz (not certain what information you need).

  • wwii
    wwii over 6 years
    Why does using pandas make it more intuitive?
  • mmann1123
    mmann1123 over 6 years
    4 lines of code is always better for newbies like myself.
  • Martijn Pieters
    Martijn Pieters over 6 years
    The regular Python code is just as short, and lets you process per line. The generator function is only there to filter stuff; how would you go about doing the same filtering in Pandas?
  • Elsa Li
    Elsa Li about 6 years
    This is awesome! Solved my problem of loading and processing large csv files using pandas. Thanks!
  • Martijn Pieters
    Martijn Pieters almost 6 years
    So, this works as efficiently as the solution which uses yield operator.: sorry, it does not. The callback function call adds more overhead, especially since you there have to handle state explicitly and separately.
  • Rishabh Agrahari
    Rishabh Agrahari almost 6 years
    @MartijnPieters Thanks. Updated the answer.
  • pydsigner
    pydsigner over 5 years
    The OP's primary issue was not one of speed, it was one of memory exhaustion. Using a different function for processing the file itself doesn't remove the downsides of reading it into a list rather than using a stream processor.
  • Dielson Sales
    Dielson Sales over 4 years
    It works very well even when the content of some rows span over multiple lines!
  • Sam Dillard
    Sam Dillard over 4 years
    doesn't the df_train=df_train.compute() line in your first solution load the whole dataset into memory...which is what he's trying not to do?
  • user5359531
    user5359531 almost 4 years
    telling beginners to install pandas just to do simple line-by-line processing on a .csv file is a mistake. We all know that Python library management is a mess, there is no reason to do this when the builtin csv module is more than capable and come pre-installed.
  • user5359531
    user5359531 almost 4 years
    do you get the same memory efficiency when using this technique with csv.DictReader? Because my tests on a 2.5GB .csv file show that trying to iterate row by row like this when using that instead of csv.reader causes the Python process to grow to the full 2.5GB memory usage.
  • Martijn Pieters
    Martijn Pieters almost 4 years
    @user5359531 that would indicate you keep references to the dictionary objects somewhere. DictReader by itself doesn’t retain references so the problem lies elsewhere.
  • Archon
    Archon over 2 years
    time.clock() has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter() or time.process_time() instead.
  • Paul Rougieux
    Paul Rougieux over 2 years
    @user5359531 data preparation and analysis is often done with pandas data frames. It is very common to load input data in the form of csv files. See also the other answer mentioning the arguments 'chunksize’ and ‘usecols’.