Python Pandas Merge Causing Memory Overflow
Although this is an old question, I recently came across the same problem.
In my instance, duplicate keys are required in both dataframes, and I needed a method which could tell if a merge will fit into memory ahead of computation, and if not, change the computation method.
The method I came up with is as follows:
Calculate merge size:
def merge_size(left_frame, right_frame, group_by, how='inner'):
left_groups = left_frame.groupby(group_by).size()
right_groups = right_frame.groupby(group_by).size()
left_keys = set(left_groups.index)
right_keys = set(right_groups.index)
intersection = right_keys & left_keys
left_diff = left_keys - intersection
right_diff = right_keys - intersection
left_nan = len(left_frame[left_frame[group_by] != left_frame[group_by]])
right_nan = len(right_frame[right_frame[group_by] != right_frame[group_by]])
left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
sizes += [left_nan * right_nan]
left_size = [left_groups[group_name] for group_name in left_diff]
right_size = [right_groups[group_name] for group_name in right_diff]
if how == 'inner':
return sum(sizes)
elif how == 'left':
return sum(sizes + left_size)
elif how == 'right':
return sum(sizes + right_size)
return sum(sizes + left_size + right_size)
Note:
At present with this method, the key can only be a label, not a list. Using a list for group_by
currently returns a sum of merge sizes for each label in the list. This will result in a merge size far larger than the actual merge size.
If you are using a list of labels for the group_by, the final row size is:
min([merge_size(df1, df2, label, how) for label in group_by])
Check if this fits in memory
The merge_size
function defined here returns the number of rows which will be created by merging two dataframes together.
By multiplying this with the count of columns from both dataframes, then multiplying by the size of np.float[32/64], you can get a rough idea of how large the resulting dataframe will be in memory. This can then be compared against psutil.virtual_memory().available
to see if your system can calculate the full merge.
def mem_fit(df1, df2, key, how='inner'):
rows = merge_size(df1, df2, key, how)
cols = len(df1.columns) + (len(df2.columns) - 1)
required_memory = (rows * cols) * np.dtype(np.float64).itemsize
return required_memory <= psutil.virtual_memory().available
The merge_size
method has been proposed as an extension of pandas
in this issue. https://github.com/pandas-dev/pandas/issues/15068.
Related videos on Youtube
pefmath
Updated on June 17, 2022Comments
-
pefmath about 2 years
I'm new to Pandas and am trying to merge a few subsets of data. I'm giving a specific case where this happens, but the question is general: How/why is it happening and how can I work around it?
The data I load is around 85 Megs or so but I often watch my python session run up close to 10 gigs of memory usage then give a memory error.
I have no idea why this happens, but it's killing me as I can't even get started looking at the data the way I want to.
Here's what I've done:
Importing the Main data
import requests, zipfile, StringIO import numpy as np import pandas as pd STAR2013url="http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013_all_csv_v3.zip" STAR2013fileName = 'ca2013_all_csv_v3.txt' r = requests.get(STAR2013url) z = zipfile.ZipFile(StringIO.StringIO(r.content)) STAR2013=pd.read_csv(z.open(STAR2013fileName))
Importing some Cross Cross Referencing Tables
STARentityList2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013entities_csv.zip" STARentityList2013fileName = "ca2013entities_csv.txt" r = requests.get(STARentityList2013url) z = zipfile.ZipFile(StringIO.StringIO(r.content)) STARentityList2013=pd.read_csv(z.open(STARentityList2013fileName)) STARlookUpTestID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/tests.zip" STARlookUpTestID2013fileName = "Tests.txt" r = requests.get(STARlookUpTestID2013url) z = zipfile.ZipFile(StringIO.StringIO(r.content)) STARlookUpTestID2013=pd.read_csv(z.open(STARlookUpTestID2013fileName)) STARlookUpSubgroupID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/subgroups.zip" STARlookUpSubgroupID2013fileName = "Subgroups.txt" r = requests.get(STARlookUpSubgroupID2013url) z = zipfile.ZipFile(StringIO.StringIO(r.content)) STARlookUpSubgroupID2013=pd.read_csv(z.open(STARlookUpSubgroupID2013fileName))
Renaming a Column ID to Allow for Merge
STARlookUpSubgroupID2013 = STARlookUpSubgroupID2013.rename(columns={'001':'Subgroup ID'}) STARlookUpSubgroupID2013
Successful Merge
merged = pd.merge(STAR2013,STARlookUpSubgroupID2013, on='Subgroup ID')
Try a second merge. This is where the Memory Overflow Happens
merged=pd.merge(merged, STARentityList2013, on='School Code')
I did all of this in ipython notebook, but don't think that changes anything.
-
Alexander almost 9 yearscheck the shape of each data frame before and after the merge
-
hilberts_drinking_problem almost 9 yearsI encountered this behavior when duplicates are present in the
on
key column. When I removed the duplicates everything worked fine. You can look at the DataFrame shape to check if you are getting any unexpected results. -
Andy Hayden almost 9 yearsOne thing that may help is to put the creation of the DataFrames in functions, that way the strings and zipfiles can be closed and garbage collected (rather than sitting in memory). One question is how big do you expect the final merge to be? If you have a n row merging with an m row it can be n*m rows, which could be VERY big here.
-
Andy Hayden almost 9 yearsAlso the DataFrame is 200 million rows!!!! My machine suggests I'm already around 2.5Gb just reading that.
-
pefmath almost 9 years@ Alexander: using .size() ? @ yakym: can you give an example of that? @ Andy: 200 million rows is big, but isn't that the point of panda's? Better handling data and 'big data.' I like the merging within a function idea. Gotta collect more memory tricks like this.
-