How to read a single parquet file in S3 into pandas dataframe using boto3?
Solution 1
For python 3.6+ AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet
to install do;
pip install awswrangler
to read a single parquet file from s3 using awswrangler 1.x.x and above, do;
import awswrangler as wr
df = wr.s3.read_parquet(path="s3://my_bucket/path/to/data_folder/my-file.parquet")
Solution 2
There is info on using PyArrow to read a Parquet file from an S3 bucket into a Pandas dataframe here: https://arrow.apache.org/docs/python/parquet.html
import pyarrow.parquet as pq
import s3fs
dataset = pq.ParquetDataset('s3://<s3_path_to_folder_or_file>',
filesystem=s3fs.S3FileSystem(), filters=[('colA', '=', 'some_value'), ('colB', '>=', some_number)])
table = dataset.read()
df = table.to_pandas()
I prefer this way of reading Parquet from S3 because it encourages the use of Parquet partitions through the filter parameter, but there is a bug affecting this approach https://issues.apache.org/jira/browse/ARROW-2038.
Solution 3
Found a way to simple read parquet file into dataframe with the utilization of boto3 package.
import boto3
import io
import pandas as pd
# Read the parquet file
buffer = io.BytesIO()
s3 = boto3.resource('s3')
object = s3.Object('my-bucket-name','path/to/parquet/file')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)
print(df.head())
oya163
while(life){ if(morning){ cout << "I wish I could sleep little bit more" << endl; } else if(afternoon){ cout << "I should make myself useful today" << endl; } else if(evening){ cout << "Should I drink tea or wine??" << endl; } else{ cout << "Why HUMANS are INHUMANE?" << endl; } };
Updated on June 05, 2022Comments
-
oya163 almost 2 years
I am trying to read a single parquet file stored in S3 bucket and convert it into pandas dataframe using boto3.
-
James O'Brien almost 4 yearsThat's a parquet dataset which I believe is a folder.
-
James O'Brien almost 4 yearsMaybe simpler: ``` import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() df = pq.read_table('s3://blah/blah.parquet', filesystem=s3).to_pandas() ```