Pandas DataFrame.merge MemoryError
Solution 1
Consider two workarounds:
CSV By CHUNKS
Apparently, read_csv can suffer performance issues and therefore large files must load in iterated chunks.
cellsfilepath = 'C:\\Path\To\Cells\CSVFile.csv'
tp = pd.io.parsers.read_csv(cellsfilepath, sep=',', iterator=True, chunksize=1000)
cell_s = pd.concat(tp, ignore_index=True)
patientsfilepath = 'C:\\Path\To\Patients\CSVFile.csv'
tp = pd.io.parsers.read_csv(patientsfilepath, sep=',', iterator=True, chunksize=1000)
patient_s = pd.concat(tp, ignore_index=True)
CSV VIA SQL
As a database guy, I always recommend handling large data loads and merging/joining with a SQL relational engine that scales well for such processes. I have written many a comment on dataframe merge Q/As to this effect -even in R. You can use any SQL database including file server dbs (Access, SQLite) or client server dbs (MySQL, MSSQL, or other), even where your dfs derive. Python maintains a built-in library for SQLite (otherwise you use ODBC); and dataframes can be pushed into databases as tables using pandas to_sql:
import sqlite3
dbfile = 'C:\\Path\To\SQlitedb.sqlite'
cxn = sqlite3.connect(dbfile)
c = cxn.cursor()
cells_s.to_sql(name='cell_s', con = cxn, if_exists='replace')
patient_s.to_sql(name='patient_s', con = cxn, if_exists='replace')
strSQL = 'SELECT * FROM cell_s c INNER JOIN patient_s p ON c.Description = p.id;'
# MIGHT HAVE TO ADJUST ABOVE FOR CELL AND PATIENT PARAMS IN DEFINED FUNCTION
merged = pd.read_sql(strSQL, cxn)
Solution 2
You may have to do it in pieces, or look into blaze. http://blaze.pydata.org
Thomas Matthew
On a quest to learn python and associated libraries (like pandas, matplotlib, sklearn, scipy, numpy, etc.) to answer questions in cancer biology. I'm building my quantitative toolkit as I go!
Updated on June 22, 2022Comments
-
Thomas Matthew almost 2 years
Goal
My goal is to merge two DataFrames by their common column (gene names) so I can take a product of each gene score across each gene row. I'd then perform a
groupby
on patients and cells and sum all scores from each. The ultimate data frame should look like this:patient cell Pat_1 22RV1 12 DU145 15 LN18 9 Pat_2 22RV1 12 DU145 15 LN18 9 Pat_3 22RV1 12 DU145 15 LN18 9
That last part should work fine, but I have not been able to perform the first merge on gene names due to a
MemoryError
. Below are snippets of each DataFrame.Data
cell_s =
Description Name level_2 0 0 LOC100009676 100009676_at LN18_CENTRAL_NERVOUS_SYSTEM 1 1 LOC100009676 100009676_at 22RV1_PROSTATE 2 2 LOC100009676 100009676_at DU145_PROSTATE 3 3 AKT3 10000_at LN18_CENTRAL_NERVOUS_SYSTEM 4 4 AKT3 10000_at 22RV1_PROSTATE 5 5 AKT3 10000_at DU145_PROSTATE 6 6 MED6 10001_at LN18_CENTRAL_NERVOUS_SYSTEM 7 7 MED6 10001_at 22RV1_PROSTATE 8 8 MED6 10001_at DU145_PROSTATE 9
cell_s is about 10,000,000 rows
patient_s =
id level_1 0 0 MED6 Pat_1 1 1 MED6 Pat_2 1 2 MED6 Pat_3 1 3 LOC100009676 Pat_1 2 4 LOC100009676 Pat_2 2 5 LOC100009676 Pat_3 2 6 ABCD Pat_1 3 7 ABCD Pat_2 3 8 ABCD Pat_3 3 ....
patient_s is about 1,200,000 rows
Code
def get_score(cell, patient): cell_s = cell.set_index(['Description', 'Name']).stack().reset_index() cell_s.columns = ['Description', 'Name', 'cell', 's1'] patient_s = patient.set_index('id').stack().reset_index() patient_s.columns = ['id', 'patient', 's2'] # fails here: merged = cell_s.merge(patient_s, left_on='Description', right_on='id') merged['score'] = merged.s1 * merged.s2 scores = merged.groupby(['patient','cell'])['score'].sum() return scores
I was getting a MemoryError when initially
read_csv
ing these files, but then specifying the dtypes resolved the issue. Confirming that my python is 64 bit did not fix my issue either. I haven't reached the limitations on pandas, have I?Python 3.4.3 |Anaconda 2.3.0 (64-bit)| Pandas 0.16.2
-
JohnE almost 9 yearsYou might save quite a bit of memory by converting strings to categoricals if you haven't already. I don't believe 64-bit pandas will have any memory limitations other than what your computer has.
-
Thomas Matthew almost 9 yearsWhen importing with 'read_csv', I set the 'dtype='category' for all columns containing strings, I get
TypeError: data type "category" not understood
. I guess because Categorical is not a numpy array. Can you think of a work around? Can I simply ignore the MemoryError? -
JohnE almost 9 yearsI think you have to convert after reading in. I usu do something like
df['x'] = df['x'].astype('category')
-
-
Thomas Matthew almost 9 yearsThanks for the two work arounds. I started running
read_csv
in chunks, and it appears to reduce the memory usage (though I haven't measured exactly). The SQL solution looks like "the way", but when I run the above code, it throws an attribute error of'sqlite3.Cursor' object has no attribute 'cursor'
atcells_s.to_sql
. I located the issue to line 1467 ofpandas/io/sql.py
, in executecur = self.con.cursor()
. Checked the docs and SO, and couldn't determine the cause of the error. Any ideas? -
Thomas Matthew almost 9 yearsEDIT: I eliminated the
c = conn.cursor()
following an example from Sebastian Raschka. Seems to work for my merge, but I wouldn't be able to query my new database, correct? -
Parfait almost 9 yearsFirst issue is due to me using the cursor object in
con
argument ofto_sql()
and not the connection object. I made the edit with a new variable,cxn
. Yes, you should be able to query db using cursor's execute() with fetchall() that outputs in list format. But do note in above, query results are being passed into a pandas df using read_sql(). -
Nickpick almost 8 yearsIt should not be necessary to do a workaround. Pandas should be able to handle this. Has anybody reported that issue to pandas?