pandas data frame - select rows and clear memory?

13,882

Solution 1

You are much better off doing something like this:

Specify usecols to sub-select which columns you want in the first place to read_csv, see here.

Then read the file in chunks, see here, if the rows that you want are select, shunt them to off, finally concatenating the result.

Pseudo-code ish:

reader = pd.read_csv('big_table.txt', sep='\t', header=0, 
                     index_col=0, usecols=the_columns_i_want_to_use, 
                     chunksize=10000)

df = pd.concat([ chunk.iloc[rows_that_I_want_] for chunk in reader ])

This will have a constant memory usage (the size of a chunk)

plus the selected rows usage x 2, which will happen when you concat the rows after the concat the usage will go down to selected rows usage

Solution 2

I've had a similar problem, I solved it with a filtering data before loading. When you read the file with read.table you are loading the whole in a DataFrame, and maybe also the whole file in memory or some duplication becouse the use of different types, so this is the 6GB used.

You could make a generator to load the contents of the file line by line, I assume that the data it's row based, one record is one row and one line in big_table.txt, so

def big_table_generator(filename):
    with open(filename, 'rt') as f:
        for line in f:
            if is_needed_row(line):   #Check if you want this row
                #cut_columns() return a list with only the selected columns
                record = cut_columns(line)    
                yield column


gen = big_table_generator('big_table.txt')
df = pandas.DataFrame.from_records(list(gen))

Note the list(gen), pandas 0.12 and previous version don't allow generators so you have to convert it to a list so all the data provided by generator it's put on memory. 0.13 will do the same thing internally. Also you need twice the memory of the data you need, one for load the data and one for put it into pandas NDframe structure.

You also could make the generator to read from a compressed file, with python 3.3 gzip library only decompress the needed chuncks.

Share:
13,882
a b
Author by

a b

Updated on June 14, 2022

Comments

  • a b
    a b almost 2 years

    I have a large pandas dataframe (size = 3 GB):

    x = read.table('big_table.txt', sep='\t', header=0, index_col=0)
    

    Because I'm working under memory constraints, I subset the dataframe:

    rows = calculate_rows() # a function that calculates what rows I need
    cols = calculate_cols() # a function that calculates what cols I need
    x = x.iloc[rows, cols]
    

    The functions that calculate the rows and columns are not important, but they are DEFINITELY a smaller subset of the original rows and columns. However, when I do this operation, memory usage increases by a lot! The original goal was to shrink the memory footprint to less than 3GB, but instead, memory usage goes well over 6GB.

    I'm guessing this is because Python creates a local copy of the dataframe in memory, but doesn't clean it up. There may also be other things that are happening... So my question is how do I subset a large dataframe and clean up the space? I can't find a function that selects rows/cols in place.

    I have read a lot of Stack Overflow, but can't find much on this topic. It could be I'm not using the right keywords, so if you have suggestions, that could also help. Thanks!

    • Roman Pekar
      Roman Pekar over 10 years
      I think it's important to see when exactly memory usage goes up. Could you try to comment last line and check if memory usage goes over 6GB or not?
    • EdChum
      EdChum over 10 years
      You can manually clear the unreferenced memory using module gc so import gc and then call delete df and then gc.collect(). However, in your case you should consider using h5py for larger than memory data.