How to import a text file on AWS S3 into pandas without writing to disk
Solution 1
pandas
uses boto
for read_csv
, so you should be able to:
import boto
data = pd.read_csv('s3://bucket....csv')
If you need boto3
because you are on python3.4+
, you can
import boto3
import io
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))
Since version 0.20.1 pandas
uses s3fs
, see answer below.
Solution 2
Now pandas can handle S3 URLs. You could simply do:
import pandas as pd
import s3fs
df = pd.read_csv('s3://bucket-name/file.csv')
You need to install s3fs
if you don't have it. pip install s3fs
Authentication
If your S3 bucket is private and requires authentication, you have two options:
1- Add access credentials to your ~/.aws/credentials
config file
[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Or
2- Set the following environment variables with their proper values:
aws_access_key_id
aws_secret_access_key
aws_session_token
Solution 3
This is now supported in latest pandas. See
http://pandas.pydata.org/pandas-docs/stable/io.html#reading-remote-files
eg.,
df = pd.read_csv('s3://pandas-test/tips.csv')
Solution 4
With s3fs it can be done as follow:
import s3fs
import pandas as pd
fs = s3fs.S3FileSystem(anon=False)
# CSV
with fs.open('mybucket/path/to/object/foo.pkl') as f:
df = pd.read_csv(f)
# Pickle
with fs.open('mybucket/path/to/object/foo.pkl') as f:
df = pd.read_pickle(f)
Solution 5
For python 3.6+ Amazon now have a really nice library to use Pandas with their services, called awswrangler.
import awswrangler as wr
import boto3
# Boto3 session
session = boto3.session.Session(aws_access_key_id='XXXX',
aws_secret_access_key='XXXX')
# Awswrangler pass forward all pd.read_csv() function args
df = wr.s3.read_csv(path='s3://bucket/path/',
boto3_session=session,
skiprows=2,
sep=';',
decimal=',',
na_values=['--'])
To install awswrangler: pip install awswrangler
Related videos on Youtube
alpalalpal
Updated on September 23, 2021Comments
-
alpalalpal over 2 years
I have a text file saved on S3 which is a tab delimited table. I want to load it into pandas but cannot save it first because I am running on a heroku server. Here is what I have so far.
import io import boto3 import os import pandas as pd os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx" os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx" s3_client = boto3.client('s3') response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt") file = response["Body"] pd.read_csv(file, header=14, delimiter="\t", low_memory=False)
the error is
OSError: Expected file path name or file-like object, got <class 'bytes'> type
How do I convert the response body into a format pandas will accept?
pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False) returns TypeError: initial_value must be str or None, not StreamingBody pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False) returns TypeError: 'StreamingBody' does not support the buffer interface
UPDATE - Using the following worked
file = response["Body"].read()
and
pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
-
MaxU - stop genocide of UA almost 8 yearstry it this way:
io.BytesIO(file)
orio.StringIO(file)
instead offile
in theread_csv()
call -
IanS almost 8 yearsYou could use
io.StringIO
as in this answer. -
alpalalpal almost 8 yearsNeither of these suggestions worked. You can see the errors in my post edit.
-
Wim Berchmans over 7 yearsThe UPDATE part worked for me. Thanks.
-
-
alpalalpal almost 8 yearsIs there a way to use a URL without making it public to everyone? The file needs to stay private.
-
Stefan almost 8 yearsThe
boto3
docs show how to configure authentication so that you can access private files as well: boto3.readthedocs.io/en/latest/guide/quickstart.html -
Sunil Rao about 7 yearsIt is throwing NoCredentialsError. How do I set s3 credentials tl it? I am new to python and boto
-
user394430 almost 7 yearsI found that I had to do the following on the last example with boto3:
df = pd.read_csv(io.BytesIO(obj['Body'].read()), encoding='utf8')
-
louis_guitton over 6 yearsI think with s3fs you can even write
df = pd.read_csv('s3://mybucket/path/to/object/foo.pkl')
-
Sip over 5 years@louis_guitton this seems to work with pd-read_csv but not with read_pickle
-
Kyler Brown about 5 yearsBeautiful. Works in python3.
-
Julio Villane about 5 yearsRemember 'S3 URLs are handled as well but require installing the S3Fs library'
-
James Wierzba almost 5 yearshow about authentication ..?
-
James Wierzba almost 5 yearswhat about authenticating
-
Sam almost 5 years@JamesWierzba , I added more details on authentication to my answer above.
-
Ivo Merchiers almost 5 yearsWhen dealing with multiple aws profiles, how can you select which profile should be used? s3fs has the profile_name option, but I'm not sure how that works with pandas.
-
Raveen Beemsingh almost 5 yearsurl with auth may be difficult unless, the url is exposed as public, not sure if simple/basic http auth will work,
-
Tom N Tech over 4 yearsThis be the way to go. Not sure why pandas doesn't require and then import s3fs though?
-
IanS over 4 years@IvoMerchiers have you found an answer to your question?
-
Ivo Merchiers over 4 years@IanS Not really, currently, I first open the file object with s3fs (using the specified profile) and then I read it with pandas, like they do here github.com/pandas-dev/pandas/issues/16692
-
gerrit over 4 yearsThis answer is out of date. Please see Wesams answer.
-
gecko655 over 3 years> When dealing with multiple aws profiles It will be supported as of 1.2.0 github.com/pandas-dev/pandas/blob/… github.com/pandas-dev/pandas/pull/35381
pd.read_csv('s3://hoge/piyo', storage_options={'aws_access_key_id': 'xxxxxxxxxx', 'aws_secret_access_key': 'yyyyyyyyyy'})
-
Chris over 3 years@Stefan Thank you so much, I've been looking for days on how to simply read a csv from S3 into pandas dataframe and this worked for me! (using Python 3.6)
-
andrey.shedko over 3 yearsPlease add some explanation to your code.
-
Brian Wylie about 3 yearsIf you're going to be using AWS/Python/Pandas/Glue/etc... AWS Wrangler is your new best friend.