How to import a text file on AWS S3 into pandas without writing to disk

130,475

Solution 1

pandas uses boto for read_csv, so you should be able to:

import boto
data = pd.read_csv('s3://bucket....csv')

If you need boto3 because you are on python3.4+, you can

import boto3
import io
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))

Since version 0.20.1 pandas uses s3fs, see answer below.

Solution 2

Now pandas can handle S3 URLs. You could simply do:

import pandas as pd
import s3fs

df = pd.read_csv('s3://bucket-name/file.csv')

You need to install s3fs if you don't have it. pip install s3fs

Authentication

If your S3 bucket is private and requires authentication, you have two options:

1- Add access credentials to your ~/.aws/credentials config file

[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Or

2- Set the following environment variables with their proper values:

  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token

Solution 3

This is now supported in latest pandas. See

http://pandas.pydata.org/pandas-docs/stable/io.html#reading-remote-files

eg.,

df = pd.read_csv('s3://pandas-test/tips.csv')

Solution 4

With s3fs it can be done as follow:

import s3fs
import pandas as pd
fs = s3fs.S3FileSystem(anon=False)

# CSV
with fs.open('mybucket/path/to/object/foo.pkl') as f:
    df = pd.read_csv(f)

# Pickle
with fs.open('mybucket/path/to/object/foo.pkl') as f:
    df = pd.read_pickle(f)

Solution 5

For python 3.6+ Amazon now have a really nice library to use Pandas with their services, called awswrangler.

import awswrangler as wr
import boto3


# Boto3 session
session = boto3.session.Session(aws_access_key_id='XXXX', 
                                aws_secret_access_key='XXXX')

# Awswrangler pass forward all pd.read_csv() function args
df = wr.s3.read_csv(path='s3://bucket/path/',
                    boto3_session=session,
                    skiprows=2,
                    sep=';',
                    decimal=',',
                    na_values=['--'])

To install awswrangler: pip install awswrangler

Share:
130,475

Related videos on Youtube

alpalalpal
Author by

alpalalpal

Updated on September 23, 2021

Comments

  • alpalalpal
    alpalalpal over 2 years

    I have a text file saved on S3 which is a tab delimited table. I want to load it into pandas but cannot save it first because I am running on a heroku server. Here is what I have so far.

    import io
    import boto3
    import os
    import pandas as pd
    
    os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx"
    os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx"
    
    s3_client = boto3.client('s3')
    response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt")
    file = response["Body"]
    
    
    pd.read_csv(file, header=14, delimiter="\t", low_memory=False)
    

    the error is

    OSError: Expected file path name or file-like object, got <class 'bytes'> type
    

    How do I convert the response body into a format pandas will accept?

    pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False)
    
    returns
    
    TypeError: initial_value must be str or None, not StreamingBody
    
    pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
    
    returns
    
    TypeError: 'StreamingBody' does not support the buffer interface
    

    UPDATE - Using the following worked

    file = response["Body"].read()
    

    and

    pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
    
    • MaxU - stop genocide of UA
      MaxU - stop genocide of UA almost 8 years
      try it this way: io.BytesIO(file) or io.StringIO(file) instead of file in the read_csv() call
    • IanS
      IanS almost 8 years
      You could use io.StringIO as in this answer.
    • alpalalpal
      alpalalpal almost 8 years
      Neither of these suggestions worked. You can see the errors in my post edit.
    • Wim Berchmans
      Wim Berchmans over 7 years
      The UPDATE part worked for me. Thanks.
  • alpalalpal
    alpalalpal almost 8 years
    Is there a way to use a URL without making it public to everyone? The file needs to stay private.
  • Stefan
    Stefan almost 8 years
    The boto3 docs show how to configure authentication so that you can access private files as well: boto3.readthedocs.io/en/latest/guide/quickstart.html
  • Sunil Rao
    Sunil Rao about 7 years
    It is throwing NoCredentialsError. How do I set s3 credentials tl it? I am new to python and boto
  • user394430
    user394430 almost 7 years
    I found that I had to do the following on the last example with boto3: df = pd.read_csv(io.BytesIO(obj['Body'].read()), encoding='utf8')
  • louis_guitton
    louis_guitton over 6 years
    I think with s3fs you can even write df = pd.read_csv('s3://mybucket/path/to/object/foo.pkl')
  • Sip
    Sip over 5 years
    @louis_guitton this seems to work with pd-read_csv but not with read_pickle
  • Kyler Brown
    Kyler Brown about 5 years
    Beautiful. Works in python3.
  • Julio Villane
    Julio Villane about 5 years
    Remember 'S3 URLs are handled as well but require installing the S3Fs library'
  • James Wierzba
    James Wierzba almost 5 years
    how about authentication ..?
  • James Wierzba
    James Wierzba almost 5 years
    what about authenticating
  • Sam
    Sam almost 5 years
    @JamesWierzba , I added more details on authentication to my answer above.
  • Ivo Merchiers
    Ivo Merchiers almost 5 years
    When dealing with multiple aws profiles, how can you select which profile should be used? s3fs has the profile_name option, but I'm not sure how that works with pandas.
  • Raveen Beemsingh
    Raveen Beemsingh almost 5 years
    url with auth may be difficult unless, the url is exposed as public, not sure if simple/basic http auth will work,
  • Tom N Tech
    Tom N Tech over 4 years
    This be the way to go. Not sure why pandas doesn't require and then import s3fs though?
  • IanS
    IanS over 4 years
    @IvoMerchiers have you found an answer to your question?
  • Ivo Merchiers
    Ivo Merchiers over 4 years
    @IanS Not really, currently, I first open the file object with s3fs (using the specified profile) and then I read it with pandas, like they do here github.com/pandas-dev/pandas/issues/16692
  • gerrit
    gerrit over 4 years
    This answer is out of date. Please see Wesams answer.
  • gecko655
    gecko655 over 3 years
    > When dealing with multiple aws profiles It will be supported as of 1.2.0 github.com/pandas-dev/pandas/blob/… github.com/pandas-dev/pandas/pull/35381 pd.read_csv('s3://hoge/piyo', storage_options={'aws_access_key_id': 'xxxxxxxxxx', 'aws_secret_access_key': 'yyyyyyyyyy'})
  • Chris
    Chris over 3 years
    @Stefan Thank you so much, I've been looking for days on how to simply read a csv from S3 into pandas dataframe and this worked for me! (using Python 3.6)
  • andrey.shedko
    andrey.shedko over 3 years
    Please add some explanation to your code.
  • Brian Wylie
    Brian Wylie about 3 years
    If you're going to be using AWS/Python/Pandas/Glue/etc... AWS Wrangler is your new best friend.