Reading a large csv from a S3 bucket using python pandas in AWS Sagemaker

13,401

I know this is quite late but here is an answer:

import boto3
bucket='sagemaker-dileepa' # Or whatever you called your bucket
data_key = 'data/stores.csv' # Where the file is within your bucket
data_location = 's3://{}/{}'.format(bucket, data_key)
df = pd.read_csv(data_location)
Share:
13,401
Dileepa Jayakody
Author by

Dileepa Jayakody

Senior research engineer at Salzburg Research Java, Python, ML, NLP developer and opensource enthusiast.

Updated on July 21, 2022

Comments

  • Dileepa Jayakody
    Dileepa Jayakody almost 2 years

    I'm trying to load a large CSV (~5GB) into pandas from S3 bucket.

    Following is the code I tried for a small CSV of 1.4 kb :

    client = boto3.client('s3') 
    obj = client.get_object(Bucket='grocery', Key='stores.csv')
    body = obj['Body']
    csv_string = body.read().decode('utf-8')
    df = pd.read_csv(StringIO(csv_string))
    

    This works well for a small CSV, but my requirement of loading a 5GB csv to pandas dataframe cannot be achieved through this (probably due to memory constraints when loading the csv by StringIO).

    I also tried below code

    s3 = boto3.client('s3')
    obj = s3.get_object(Bucket='bucket', Key='key')
    df = pd.read_csv(obj['Body'])
    

    but this gives below error.

    ValueError: Invalid file path or buffer object type: <class 'botocore.response.StreamingBody'>
    

    Any help to resolve this error is much appreciated.