Read Parquet file stored in S3 with AWS Lambda (Python 3)

11,868

Solution 1

AWS has a project (AWS Data Wrangler) that allows it with full Lambda Layers support.

In the Docs there is a step-by-step to do it.

Code example:

import awswrangler as wr

# Write
wr.s3.to_parquet(
    dataframe=df,
    path="s3://...",
    dataset=True,
    database="my_database",  # Optional, only with you want it available on Athena/Glue Catalog
    table="my_table",
    partition_cols=["PARTITION_COL_NAME"])

# READ
df = wr.s3.read_parquet(path="s3://...")

Reference

Solution 2

I was able to accomplish writing parquet files into S3 using fastparquet. It's a little tricky but my breakthrough came when I realized that to put together all the dependencies, I had to use the same exact Linux that Lambda is using.

Here's how I did it:

1. Spin up a EC2 instance using the Amazon Linux image that is used with Lambda

Source: https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html

Linux image: https://console.aws.amazon.com/ec2/v2/home#Images:visibility=public-images;search=amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2

Note: you might need to install many packages and change python version to 3.6 as this Linux is not meant for development. Here's how I looked for packages:

sudo yum list | grep python3

I installed:

python36.x86_64
python36-devel.x86_64
python36-libs.x86_64
python36-pip.noarch
python36-setuptools.noarch
python36-tools.x86_64

2. Used the instructions from here to built a zip file with all of the dependencies that my script would use with dumping them all in a folder and the zipping them with this command:

mkdir parquet
cd parquet
pip install -t . fastparquet 
pip install -t . (any other dependencies)
copy my python file in this folder
zip and upload into Lambda

Note: there are some constraints I had to work around: Lambda doesn't let you upload zip larger 50M and unzipped > 260M. If anyone knows a better way to get dependencies into Lambda, please do share.

Source: Write parquet from AWS Kinesis firehose to AWS S3

Solution 3

This was an environment issue (Lambda in VPC not getting access to the bucket). Pyarrow is now working.
Hopefully the question itself will give a good-enough overview on how to make all that work.

Share:
11,868

Related videos on Youtube

Ptah
Author by

Ptah

Updated on June 04, 2022

Comments

  • Ptah
    Ptah over 1 year

    I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is:

    It seems that there are two possible approaches, which both work locally to the docker container:

    1. fastparquet with s3fs: Unfortunately the unzipped size of the package is bigger than 256MB and therefore I can't update the Lambda code with it.
    2. pyarrow with s3fs: I followed https://github.com/apache/arrow/pull/916 and when executed with the lambda function I get either:

      • If I prefix the URI with S3 or S3N (as in the code example): In the Lambda environment OSError: Passed non-file path: s3://mybucket/path/to/myfile in pyarrow/parquet.py, line 848. Locally I get IndexError: list index out of range in pyarrow/parquet.py, line 714
      • If I don't prefix the URI with S3 or S3N: It works locally (I can read the parquet data). In the Lambda environment, I get the same OSError: Passed non-file path: s3://mybucket/path/to/myfile in pyarrow/parquet.py, line 848.

    My questions are :

    • why do I get a different result in my docker container than I do in the Lambda environment?
    • what is the proper way to give the URI?
    • is there an accepted way to read Parquet files in S3 through AWS Lambda?

    Thanks!

  • vertigokidd
    vertigokidd almost 6 years
    Could you provide any more info on how you got this working? I keep running into an ImportError even though the pyarrow package is in my zip. It keeps saying pyarrow is required for parquet support I am running on Python 2.7 so that could be the issue.
  • Ptah
    Ptah almost 6 years
    It might be, but it's difficult to diagnose without context. Your error does not look like an ModuleNotFoundError or ImportError though. I followed the links I gave to create my env. Roughly: docker run -it lambci/lambda:build-python3.6 bash mkdir lambda cd lambda virtualenv ~/lambda source ~/lambda/bin/activate pip install pyarrow pip install pandas pip install s3fs cd $VIRTUAL_ENV/lib/python3.6/site-packages zip -r9 ~/lambda.zip . [get the lambda.zip locally] zip -ur ../lambda.zip lambda_function.py
  • vertigokidd
    vertigokidd almost 6 years
    Thanks for the reply @Ptah. I took a guess that it was an incompatibility with the Python 2.7 runtime in AWS Lambda and I was right. Once I upgraded the code to run on 3.6 and built the upgraded zip file pyarrow worked without a problem. Hopefully this helps others who run into this.
  • phoenix
    phoenix over 5 years
    Could you give us a sample of the code? I wonder how are you writing the files and using the s3fs object... I'm trying with pyarrow.parquet.write_table() but it's not happening for me. Thanks!
  • nitinr708
    nitinr708 over 3 years
    For many-linux variant of pyarrow, If you keep all the *16 files only, get rid of the other larger copy for each lib and then repackage them into a zip file, the lambda package size comes under 250 which the AWS lambda accepts
  • Powers
    Powers about 3 years
    Can you run these commands on mac OS or does this approach need to be run on a Linux machine?
  • Miguel Trejo
    Miguel Trejo about 3 years
    @Powers, they work on Mac, Linux and Windows. Are you getting any errors?