Pandas DataFrame.merge MemoryError

15,312

Solution 1

Consider two workarounds:

CSV By CHUNKS

Apparently, read_csv can suffer performance issues and therefore large files must load in iterated chunks.

cellsfilepath = 'C:\\Path\To\Cells\CSVFile.csv'
tp = pd.io.parsers.read_csv(cellsfilepath, sep=',', iterator=True, chunksize=1000)
cell_s = pd.concat(tp, ignore_index=True)

patientsfilepath = 'C:\\Path\To\Patients\CSVFile.csv'
tp = pd.io.parsers.read_csv(patientsfilepath, sep=',', iterator=True, chunksize=1000)
patient_s = pd.concat(tp, ignore_index=True)

CSV VIA SQL

As a database guy, I always recommend handling large data loads and merging/joining with a SQL relational engine that scales well for such processes. I have written many a comment on dataframe merge Q/As to this effect -even in R. You can use any SQL database including file server dbs (Access, SQLite) or client server dbs (MySQL, MSSQL, or other), even where your dfs derive. Python maintains a built-in library for SQLite (otherwise you use ODBC); and dataframes can be pushed into databases as tables using pandas to_sql:

import sqlite3

dbfile = 'C:\\Path\To\SQlitedb.sqlite'
cxn = sqlite3.connect(dbfile)
c = cxn.cursor()

cells_s.to_sql(name='cell_s', con = cxn, if_exists='replace')
patient_s.to_sql(name='patient_s', con = cxn, if_exists='replace')

strSQL = 'SELECT * FROM cell_s c INNER JOIN patient_s p ON c.Description = p.id;'
# MIGHT HAVE TO ADJUST ABOVE FOR CELL AND PATIENT PARAMS IN DEFINED FUNCTION

merged = pd.read_sql(strSQL, cxn)

Solution 2

You may have to do it in pieces, or look into blaze. http://blaze.pydata.org

Share:
15,312
Thomas Matthew
Author by

Thomas Matthew

On a quest to learn python and associated libraries (like pandas, matplotlib, sklearn, scipy, numpy, etc.) to answer questions in cancer biology. I'm building my quantitative toolkit as I go!

Updated on June 22, 2022

Comments

  • Thomas Matthew
    Thomas Matthew almost 2 years

    Goal

    My goal is to merge two DataFrames by their common column (gene names) so I can take a product of each gene score across each gene row. I'd then perform a groupby on patients and cells and sum all scores from each. The ultimate data frame should look like this:

        patient  cell 
        Pat_1    22RV1    12
                 DU145    15
                 LN18      9
        Pat_2    22RV1    12
                 DU145    15
                 LN18      9
        Pat_3    22RV1    12
                 DU145    15
                 LN18      9
    

    That last part should work fine, but I have not been able to perform the first merge on gene names due to a MemoryError. Below are snippets of each DataFrame.

    Data

    cell_s =

        Description          Name                      level_2  0
    0  LOC100009676  100009676_at  LN18_CENTRAL_NERVOUS_SYSTEM  1
    1  LOC100009676  100009676_at               22RV1_PROSTATE  2
    2  LOC100009676  100009676_at               DU145_PROSTATE  3
    3          AKT3      10000_at  LN18_CENTRAL_NERVOUS_SYSTEM  4
    4          AKT3      10000_at               22RV1_PROSTATE  5
    5          AKT3      10000_at               DU145_PROSTATE  6
    6          MED6      10001_at  LN18_CENTRAL_NERVOUS_SYSTEM  7
    7          MED6      10001_at               22RV1_PROSTATE  8
    8          MED6      10001_at               DU145_PROSTATE  9
    

    cell_s is about 10,000,000 rows

    patient_s =

                 id level_1  0
    0          MED6   Pat_1  1
    1          MED6   Pat_2  1
    2          MED6   Pat_3  1
    3  LOC100009676   Pat_1  2
    4  LOC100009676   Pat_2  2
    5  LOC100009676   Pat_3  2
    6          ABCD   Pat_1  3
    7          ABCD   Pat_2  3
    8          ABCD   Pat_3  3
        ....
    

    patient_s is about 1,200,000 rows

    Code

    def get_score(cell, patient):
        cell_s = cell.set_index(['Description', 'Name']).stack().reset_index()
        cell_s.columns = ['Description', 'Name', 'cell', 's1']
    
        patient_s = patient.set_index('id').stack().reset_index()
        patient_s.columns = ['id', 'patient', 's2']
    
        # fails here:
        merged = cell_s.merge(patient_s, left_on='Description', right_on='id')
        merged['score'] = merged.s1 * merged.s2
    
        scores = merged.groupby(['patient','cell'])['score'].sum()
        return scores
    

    I was getting a MemoryError when initially read_csving these files, but then specifying the dtypes resolved the issue. Confirming that my python is 64 bit did not fix my issue either. I haven't reached the limitations on pandas, have I?

    Python 3.4.3 |Anaconda 2.3.0 (64-bit)| Pandas 0.16.2

    • JohnE
      JohnE almost 9 years
      You might save quite a bit of memory by converting strings to categoricals if you haven't already. I don't believe 64-bit pandas will have any memory limitations other than what your computer has.
    • Thomas Matthew
      Thomas Matthew almost 9 years
      When importing with 'read_csv', I set the 'dtype='category' for all columns containing strings, I get TypeError: data type "category" not understood. I guess because Categorical is not a numpy array. Can you think of a work around? Can I simply ignore the MemoryError?
    • JohnE
      JohnE almost 9 years
      I think you have to convert after reading in. I usu do something like df['x'] = df['x'].astype('category')
  • Thomas Matthew
    Thomas Matthew almost 9 years
    Thanks for the two work arounds. I started running read_csv in chunks, and it appears to reduce the memory usage (though I haven't measured exactly). The SQL solution looks like "the way", but when I run the above code, it throws an attribute error of 'sqlite3.Cursor' object has no attribute 'cursor' at cells_s.to_sql. I located the issue to line 1467 of pandas/io/sql.py , in execute cur = self.con.cursor(). Checked the docs and SO, and couldn't determine the cause of the error. Any ideas?
  • Thomas Matthew
    Thomas Matthew almost 9 years
    EDIT: I eliminated the c = conn.cursor() following an example from Sebastian Raschka. Seems to work for my merge, but I wouldn't be able to query my new database, correct?
  • Parfait
    Parfait almost 9 years
    First issue is due to me using the cursor object in con argument of to_sql() and not the connection object. I made the edit with a new variable, cxn. Yes, you should be able to query db using cursor's execute() with fetchall() that outputs in list format. But do note in above, query results are being passed into a pandas df using read_sql().
  • Nickpick
    Nickpick almost 8 years
    It should not be necessary to do a workaround. Pandas should be able to handle this. Has anybody reported that issue to pandas?