How to read a single parquet file in S3 into pandas dataframe using boto3?

10,682

Solution 1

For python 3.6+ AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet

to install do;

pip install awswrangler

to read a single parquet file from s3 using awswrangler 1.x.x and above, do;

import awswrangler as wr
df = wr.s3.read_parquet(path="s3://my_bucket/path/to/data_folder/my-file.parquet")

Solution 2

There is info on using PyArrow to read a Parquet file from an S3 bucket into a Pandas dataframe here: https://arrow.apache.org/docs/python/parquet.html

import pyarrow.parquet as pq
import s3fs

dataset = pq.ParquetDataset('s3://<s3_path_to_folder_or_file>', 
filesystem=s3fs.S3FileSystem(), filters=[('colA', '=', 'some_value'), ('colB', '>=', some_number)])
table = dataset.read()
df = table.to_pandas()

I prefer this way of reading Parquet from S3 because it encourages the use of Parquet partitions through the filter parameter, but there is a bug affecting this approach https://issues.apache.org/jira/browse/ARROW-2038.

Solution 3

Found a way to simple read parquet file into dataframe with the utilization of boto3 package.

import boto3
import io
import pandas as pd

# Read the parquet file
buffer = io.BytesIO()
s3 = boto3.resource('s3')
object = s3.Object('my-bucket-name','path/to/parquet/file')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)

print(df.head())
Share:
10,682
oya163
Author by

oya163

while(life){ if(morning){ cout &lt;&lt; "I wish I could sleep little bit more" &lt;&lt; endl; } else if(afternoon){ cout &lt;&lt; "I should make myself useful today" &lt;&lt; endl; } else if(evening){ cout &lt;&lt; "Should I drink tea or wine??" &lt;&lt; endl; } else{ cout &lt;&lt; "Why HUMANS are INHUMANE?" &lt;&lt; endl; } };

Updated on June 05, 2022

Comments

  • oya163
    oya163 almost 2 years

    I am trying to read a single parquet file stored in S3 bucket and convert it into pandas dataframe using boto3.

  • James O'Brien
    James O'Brien almost 4 years
    That's a parquet dataset which I believe is a folder.
  • James O'Brien
    James O'Brien almost 4 years
    Maybe simpler: ``` import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() df = pq.read_table('s3://blah/blah.parquet', filesystem=s3).to_pandas() ```