How do I release memory used by a pandas dataframe?

180,440

Solution 1

Reducing memory usage in Python is difficult, because Python does not actually release memory back to the operating system. If you delete objects, then the memory is available to new Python objects, but not free()'d back to the system (see this question).

If you stick to numeric numpy arrays, those are freed, but boxed objects are not.

>>> import os, psutil, numpy as np # psutil may need to be installed
>>> def usage():
...     process = psutil.Process(os.getpid())
...     return process.memory_info()[0] / float(2 ** 20)
... 
>>> usage() # initial memory usage
27.5 

>>> arr = np.arange(10 ** 8) # create a large array without boxing
>>> usage()
790.46875
>>> del arr
>>> usage()
27.52734375 # numpy just free()'d the array

>>> arr = np.arange(10 ** 8, dtype='O') # create lots of objects
>>> usage()
3135.109375
>>> del arr
>>> usage()
2372.16796875  # numpy frees the array, but python keeps the heap big

Reducing the Number of Dataframes

Python keep our memory at high watermark, but we can reduce the total number of dataframes we create. When modifying your dataframe, prefer inplace=True, so you don't create copies.

Another common gotcha is holding on to copies of previously created dataframes in ipython:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'foo': [1,2,3,4]})

In [3]: df + 1
Out[3]: 
   foo
0    2
1    3
2    4
3    5

In [4]: df + 2
Out[4]: 
   foo
0    3
1    4
2    5
3    6

In [5]: Out # Still has all our temporary DataFrame objects!
Out[5]: 
{3:    foo
 0    2
 1    3
 2    4
 3    5, 4:    foo
 0    3
 1    4
 2    5
 3    6}

You can fix this by typing %reset Out to clear your history. Alternatively, you can adjust how much history ipython keeps with ipython --cache-size=5 (default is 1000).

Reducing Dataframe Size

Wherever possible, avoid using object dtypes.

>>> df.dtypes
foo    float64 # 8 bytes per value
bar      int64 # 8 bytes per value
baz     object # at least 48 bytes per value, often more

Values with an object dtype are boxed, which means the numpy array just contains a pointer and you have a full Python object on the heap for every value in your dataframe. This includes strings.

Whilst numpy supports fixed-size strings in arrays, pandas does not (it's caused user confusion). This can make a significant difference:

>>> import numpy as np
>>> arr = np.array(['foo', 'bar', 'baz'])
>>> arr.dtype
dtype('S3')
>>> arr.nbytes
9

>>> import sys; import pandas as pd
>>> s = pd.Series(['foo', 'bar', 'baz'])
dtype('O')
>>> sum(sys.getsizeof(x) for x in s)
120

You may want to avoid using string columns, or find a way of representing string data as numbers.

If you have a dataframe that contains many repeated values (NaN is very common), then you can use a sparse data structure to reduce memory usage:

>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 605.5 MB

>>> df1.shape
(39681584, 1)

>>> df1.foo.isnull().sum() * 100. / len(df1)
20.628483479893344 # so 20% of values are NaN

>>> df1.to_sparse().info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 543.0 MB

Viewing Memory Usage

You can view the memory usage (docs):

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 14 columns):
...
dtypes: datetime64[ns](1), float64(8), int64(1), object(4)
memory usage: 4.4+ GB

As of pandas 0.17.1, you can also do df.info(memory_usage='deep') to see memory usage including objects.

Solution 2

As noted in the comments, there are some things to try: gc.collect (@EdChum) may clear stuff, for example. At least from my experience, these things sometimes work and often don't.

There is one thing that always works, however, because it is done at the OS, not language, level.

Suppose you have a function that creates an intermediate huge DataFrame, and returns a smaller result (which might also be a DataFrame):

def huge_intermediate_calc(something):
    ...
    huge_df = pd.DataFrame(...)
    ...
    return some_aggregate

Then if you do something like

import multiprocessing

result = multiprocessing.Pool(1).map(huge_intermediate_calc, [something_])[0]

Then the function is executed at a different process. When that process completes, the OS retakes all the resources it used. There's really nothing Python, pandas, the garbage collector, could do to stop that.

Solution 3

This solves the problem of releasing the memory for me!!!

import gc
import pandas as pd

del [[df_1,df_2]]
gc.collect()
df_1=pd.DataFrame()
df_2=pd.DataFrame()

the data-frame will be explicitly set to null

in the above statements

Firstly, the self reference of the dataframe is deleted meaning the dataframe is no longer available to python there after all the references of the dataframe is collected by garbage collector (gc.collect()) and then explicitly set all the references to empty dataframe.

more on the working of garbage collector is well explained in https://stackify.com/python-garbage-collection/

Solution 4

del df will not be deleted if there are any reference to the df at the time of deletion. So you need to to delete all the references to it with del df to release the memory.

So all the instances bound to df should be deleted to trigger garbage collection.

Use objgragh to check which is holding onto the objects.

Solution 5

It seems there is an issue with glibc that affects the memory allocation in Pandas: https://github.com/pandas-dev/pandas/issues/2659

The monkey patch detailed on this issue has resolved the problem for me:

# monkeypatches.py

# Solving memory leak problem in pandas
# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083
import pandas as pd
from ctypes import cdll, CDLL
try:
    cdll.LoadLibrary("libc.so.6")
    libc = CDLL("libc.so.6")
    libc.malloc_trim(0)
except (OSError, AttributeError):
    libc = None

__old_del = getattr(pd.DataFrame, '__del__', None)

def __new_del(self):
    if __old_del:
        __old_del(self)
    libc.malloc_trim(0)

if libc:
    print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)
    pd.DataFrame.__del__ = __new_del
else:
    print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr)
Share:
180,440

Related videos on Youtube

b10hazard
Author by

b10hazard

Updated on July 26, 2021

Comments

  • b10hazard
    b10hazard almost 3 years

    I have a really large csv file that I opened in pandas as follows....

    import pandas
    df = pandas.read_csv('large_txt_file.txt')
    

    Once I do this my memory usage increases by 2GB, which is expected because this file contains millions of rows. My problem comes when I need to release this memory. I ran....

    del df
    

    However, my memory usage did not drop. Is this the wrong approach to release memory used by a pandas data frame? If it is, what is the proper way?

    • EdChum
      EdChum over 7 years
      that is correct, the garbage collector may not release the memory straight away, you can also import the gc module and call gc.collect() but it may not recover the memory
    • Marlon Abeykoon
      Marlon Abeykoon over 7 years
      del df is not called directly after the creation of df right? I think there are references to the df at the point you delete the df. So it wont be deleted instead it deletes the name.
    • chepner
      chepner over 7 years
      Whether or not memory reclaimed by the garbage collector is actually given back to the OS is implementation dependent; the only guarantee the garbage collector makes is that reclaimed memory can be used by the current Python process for other things instead of asking or even more memory from the OS.
    • b10hazard
      b10hazard over 7 years
      I am calling del df right after creation. I did not add any other references to df. All I did was open ipython and run those three lines of code. If I run the same code on some other object that takes a lot of memory, like say a numpy array. del nparray works perfectly
    • pitchounet
      pitchounet over 6 years
      @b10hazard : What about something like df = '' at the end of you code ? Seems to clear RAM used by the dataframe.
    • Jaya Kommuru
      Jaya Kommuru over 3 years
      df = '' is working for me
  • b10hazard
    b10hazard over 7 years
    I tried running gc.collect() but that did not release the memory. Opening in a process does work, but this seems like such a unnecessary approach for something as simple as deleting a dataframe when it is no longer needed. Is pandas supposed to behave like this?
  • Ami Tavory
    Ami Tavory over 7 years
    @b10hazard Even without pandas, I have never fully understood how Python memory works in practice. This crude technique is the only thing on which I rely.
  • Zertrin
    Zertrin over 6 years
    Works really good. However in a ipython environment (like jupyter notebook) I found that you need to .close() and .join() or .terminate() the pool to get rid of the spawned process. The easiest way of doing that since Python 3.3 is to use the context management protocol: with multiprocessing.Pool(1) as pool: result = pool.map(huge_intermediate_calc, [something]) which takes are of closing the pool once done.
  • goks
    goks about 6 years
    Why dataframes added in sub-list [[df_1,df_2]] ? Any specific reason? Please explain.
  • Andrey Nikishaev
    Andrey Nikishaev about 6 years
    This works good, just don't forget terminate&join pool after task is done.
  • muammar
    muammar over 5 years
    After reading several times on how to claim back the memory from a python object, this seems to be as the best way to do that. Create a process, and when that process is killed then the OS releases the memory.
  • giwiro
    giwiro about 5 years
    Maybe it help someone, when creating the Pool try to use maxtasksperchild = 1 in order to release the process and spawn a new one after job is done.
  • spacedustpi
    spacedustpi over 4 years
    Why don't you just use the last two statements? I don't think you need the first two statements.
  • pedram bashiri
    pedram bashiri about 4 years
    This has to be marked 'Accepted Answer'. It briefly but clearly explains how python holds on to memory even when it doesn't really need it. The tips for saving memory are all sensible and useful. As another tip I would just add using 'multiprocessing' (as explained in @Ami's answer.
  • tdelaney
    tdelaney about 4 years
    Another option is for the subprocess to write the dataframe to disk using something like parquet. It may be faster than moving a big pickled dataframe back to the parent. It will be in the disk cache so its fast. And since you are now building intermediate dataframes, it could check timestamps to see if the conversion is needed.
  • dom free
    dom free almost 4 years
    A clever way to free the memory
  • ajayramesh
    ajayramesh over 3 years
    Most likely, If I use the latest version of Panda then I might not face this issue, right?
  • ajayramesh
    ajayramesh over 3 years
    I am also facing same issue, but in my case, I am using drop api from pandas, also I added above fix. fingers crossed
  • MarkNS
    MarkNS over 3 years
    @ajayramesh the github issue linked was closed "won't fix", so I assume the issue is still present with Pandas 1.0
  • hansrajswapnil
    hansrajswapnil over 3 years
    I am trying to run something similar, but with more than one argument. Check out the Process class at docs.python.org/3.7/library/multiprocessing.html
  • Josh Friedlander
    Josh Friedlander about 3 years
    "When possible use inplace=True". No, this is a myth! See this answer for why. (Otherwise, great answer overall.)
  • Kerem T
    Kerem T about 3 years
    for file in files: with multiprocessing.Pool(1) as pool: result = pool.map(process_file, [file]) to be specific this worked for me. You can use more workers than 1 and process multiple files at the same time.
  • Brian Yang
    Brian Yang about 3 years
    @spacedustpi because only using the last two statements won't work.
  • cicero
    cicero about 2 years
    why is it needed to assign the df to an empty dataframe ? calling the gc after del is not enough to clear RAM?
  • hardi
    hardi about 2 years
    the last two statements makes it explicit that anything not collected by gc is set to be empty after!!!