Pandas read csv out of memory

13,856

Solution 1

Based on your snippet in out of memory error when reading csv file in chunk, when reading line-by-line.

I assume that kb_2 is the error indicator,

groups={}
with open("data/petaJoined.csv", "r") as large_file:
    for line in large_file:
        arr=line.split('\t')
        #assuming this structure: ka,kb_1,kb_2,timeofEvent,timeInterval
        k=arr[0]+','+arr[1]
        if not (k in groups.keys())
            groups[k]={'record_count':0, 'error_sum': 0}
        groups[k]['record_count']=groups[k]['record_count']+1
        groups[k]['error_sum']=groups[k]['error_sum']+float(arr[2])
for k,v in groups.items:
    print ('{group}: {error_rate}'.format(group=k,error_rate=v['error_sum']/v['record_count']))

This code snippet stores all the groups in a dictionary, and calculates the error rate after reading the entire file.

It will encounter an out-of-memory exception, if there are too many combinations of groups.

Solution 2

You haven't stated what your intended aggregation would be, but if it's just sum and count, then you could aggregate in chunks:

dfs = pd.DataFrame()
reader = pd.read_table(strFileName, chunksize=16*1024)  # choose as appropriate
for chunk in reader:
    temp = chunk.agg(...)  # your logic here
    dfs.append(temp)
df = dfs.agg(...)  # redo your logic here

Solution 3

What @chrisaycock suggested is the preferred method if you need to sum or count

If you need to average, it won't work because avg(a,b,c,d) does not equal avg(avg(a,b),avg(c,d))

I suggest using a map-reduce like approach, with streaming

create a file called map-col.py

import sys
for line in sys.stdin:
   print (line.split('\t')[col])

And a file named reduce-avg.py

import sys
s=0
n=0
for line in sys.stdin:
   s=s+float(line)
   n=n+1
print (s/n)

And in order to run the whole thing:

cat strFileName|python map-col.py|python reduce-avg.py>output.txt

This method will work regardless of the size of the file, and will not run out of memory

Share:
13,856
sunxd
Author by

sunxd

Updated on July 26, 2022

Comments

  • sunxd
    sunxd almost 2 years

    I try to manipulate a large CSV file using Pandas, when I wrote this

    df = pd.read_csv(strFileName,sep='\t',delimiter='\t')
    

    it raises "pandas.parser.CParserError: Error tokenizing data. C error: out of memory" wc -l indicate there are 13822117 lines, I need to aggregate on this csv file data frame, is there a way to handle this other then split the csv into several files and write codes to merge the results? Any suggestions on how to do that? Thanks

    The input is like this:

    columns=[ka,kb_1,kb_2,timeofEvent,timeInterval]
    0:'3M' '2345' '2345' '2014-10-5',3000
    1:'3M' '2958' '2152' '2015-3-22',5000
    2:'GE' '2183' '2183' '2012-12-31',515
    3:'3M' '2958' '2958' '2015-3-10',395
    4:'GE' '2183' '2285' '2015-4-19',1925
    5:'GE' '2598' '2598' '2015-3-17',1915
    

    And the desired output is like this:

    columns=[ka,kb,errorNum,errorRate,totalNum of records]
    '3M','2345',0,0%,1
    '3M','2958',1,50%,2
    'GE','2183',1,50%,2
    'GE','2598',0,0%,1
    

    if the data set is small, the below code could be used as provided by another

    df2 = df.groupby(['ka','kb_1'])['isError'].agg({ 'errorNum':  'sum',
                                                 'recordNum': 'count' })
    
    df2['errorRate'] = df2['errorNum'] / df2['recordNum']
    
    ka kb_1  recordNum  errorNum  errorRate
    
    3M 2345          1         0        0.0
       2958          2         1        0.5
    GE 2183          2         1        0.5
       2598          1         0        0.0
    

    (definition of error Record: when kb_1!=kb_2,the corresponding record is treated as abnormal record)