Python Pandas Merge Causing Memory Overflow

python memory pandas merge out-of-memory

10,605

Although this is an old question, I recently came across the same problem.

In my instance, duplicate keys are required in both dataframes, and I needed a method which could tell if a merge will fit into memory ahead of computation, and if not, change the computation method.

The method I came up with is as follows:

Calculate merge size:

def merge_size(left_frame, right_frame, group_by, how='inner'):
    left_groups = left_frame.groupby(group_by).size()
    right_groups = right_frame.groupby(group_by).size()
    left_keys = set(left_groups.index)
    right_keys = set(right_groups.index)
    intersection = right_keys & left_keys
    left_diff = left_keys - intersection
    right_diff = right_keys - intersection

    left_nan = len(left_frame[left_frame[group_by] != left_frame[group_by]])
    right_nan = len(right_frame[right_frame[group_by] != right_frame[group_by]])
    left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
    right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan

    sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
    sizes += [left_nan * right_nan]

    left_size = [left_groups[group_name] for group_name in left_diff]
    right_size = [right_groups[group_name] for group_name in right_diff]
    if how == 'inner':
        return sum(sizes)
    elif how == 'left':
        return sum(sizes + left_size)
    elif how == 'right':
        return sum(sizes + right_size)
    return sum(sizes + left_size + right_size)

Note:

At present with this method, the key can only be a label, not a list. Using a list for group_by currently returns a sum of merge sizes for each label in the list. This will result in a merge size far larger than the actual merge size.

If you are using a list of labels for the group_by, the final row size is:

min([merge_size(df1, df2, label, how) for label in group_by])

Check if this fits in memory

The merge_size function defined here returns the number of rows which will be created by merging two dataframes together.

By multiplying this with the count of columns from both dataframes, then multiplying by the size of np.float[32/64], you can get a rough idea of how large the resulting dataframe will be in memory. This can then be compared against psutil.virtual_memory().available to see if your system can calculate the full merge.

def mem_fit(df1, df2, key, how='inner'):
    rows = merge_size(df1, df2, key, how)
    cols = len(df1.columns) + (len(df2.columns) - 1)
    required_memory = (rows * cols) * np.dtype(np.float64).itemsize

    return required_memory <= psutil.virtual_memory().available

The merge_size method has been proposed as an extension of pandas in this issue. https://github.com/pandas-dev/pandas/issues/15068.

10,605

pefmath

Updated on June 17, 2022

Comments

pefmath about 2 years
I'm new to Pandas and am trying to merge a few subsets of data. I'm giving a specific case where this happens, but the question is general: How/why is it happening and how can I work around it?

The data I load is around 85 Megs or so but I often watch my python session run up close to 10 gigs of memory usage then give a memory error.

I have no idea why this happens, but it's killing me as I can't even get started looking at the data the way I want to.

Here's what I've done:

Importing the Main data
```
import requests, zipfile, StringIO
import numpy as np
import pandas as pd 


STAR2013url="http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013_all_csv_v3.zip"
STAR2013fileName = 'ca2013_all_csv_v3.txt'

r = requests.get(STAR2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))

STAR2013=pd.read_csv(z.open(STAR2013fileName))
```
Importing some Cross Cross Referencing Tables
```
STARentityList2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013entities_csv.zip"
STARentityList2013fileName = "ca2013entities_csv.txt"
r = requests.get(STARentityList2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARentityList2013=pd.read_csv(z.open(STARentityList2013fileName))

STARlookUpTestID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/tests.zip"
STARlookUpTestID2013fileName = "Tests.txt"
r = requests.get(STARlookUpTestID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpTestID2013=pd.read_csv(z.open(STARlookUpTestID2013fileName))

STARlookUpSubgroupID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/subgroups.zip"
STARlookUpSubgroupID2013fileName = "Subgroups.txt"
r = requests.get(STARlookUpSubgroupID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpSubgroupID2013=pd.read_csv(z.open(STARlookUpSubgroupID2013fileName))
```
Renaming a Column ID to Allow for Merge
```
STARlookUpSubgroupID2013 = STARlookUpSubgroupID2013.rename(columns={'001':'Subgroup ID'})
STARlookUpSubgroupID2013
```
Successful Merge
```
merged = pd.merge(STAR2013,STARlookUpSubgroupID2013, on='Subgroup ID')
```
Try a second merge. This is where the Memory Overflow Happens
```
merged=pd.merge(merged, STARentityList2013, on='School Code')
```
I did all of this in ipython notebook, but don't think that changes anything.
- Alexander almost 9 years
  
  check the shape of each data frame before and after the merge
- hilberts_drinking_problem almost 9 years
  
  I encountered this behavior when duplicates are present in the on key column. When I removed the duplicates everything worked fine. You can look at the DataFrame shape to check if you are getting any unexpected results.
- Andy Hayden almost 9 years
  
  One thing that may help is to put the creation of the DataFrames in functions, that way the strings and zipfiles can be closed and garbage collected (rather than sitting in memory). One question is how big do you expect the final merge to be? If you have a n row merging with an m row it can be n*m rows, which could be VERY big here.
- Andy Hayden almost 9 years
  
  Also the DataFrame is 200 million rows!!!! My machine suggests I'm already around 2.5Gb just reading that.
- pefmath almost 9 years
  
  @ Alexander: using .size() ? @ yakym: can you give an example of that? @ Andy: 200 million rows is big, but isn't that the point of panda's? Better handling data and 'big data.' I like the merging within a function idea. Gotta collect more memory tricks like this.