Can I test AWS Glue code locally?

21,612

Solution 1

Eventually, as of Aug 28, 2019, Amazon allows you to download the binaries and

develop, compile, debug, and single-step Glue ETL scripts and complex Spark applications in Scala and Python locally.

Check out this link: https://aws.amazon.com/about-aws/whats-new/2019/08/aws-glue-releases-binaries-of-glue-etl-libraries-for-glue-jobs/

Solution 2

I spoke to an AWS sales engineer and they said no, you can only test Glue code by running a Glue transform (in the cloud). He mentioned that there were testing out something called Outpost to allow on-prem operations, but that it wasn't publically available yet. So this seems like a solid "no" which is a shame because it otherwise seems pretty nice. But with out unit tests, its no-go for me.

Solution 3

You can keep glue and pyspark code in separate files and can unit-test pyspark code locally. For zipping dependency files, we wrote shell script which zips files and upload to s3 location and then applies CF template to deploy glue job. For detecting dependencies, we created (glue job)_dependency.txt file.

Solution 4

There is now an official docker from AWS so that you can execute Glue locally: https://aws.amazon.com/blogs/big-data/building-an-aws-glue-etl-pipeline-locally-without-an-aws-account/

There's a nice step-by-step guide on that page as well

Solution 5

Not that I know of, and if you have a lot of remote assets, it will be tricky. Using Windows, I normally run a development endpoint and a local zeppelin notebook while I am authoring my job. I shut it down each day.

You could use the job editor > script editor to edit, save, and run the job. Not sure of the cost difference.

Share:
21,612
lfk
Author by

lfk

Updated on December 01, 2021

Comments

  • lfk
    lfk over 2 years

    After reading Amazon docs, my understanding is that the only way to run/test a Glue script is to deploy it to a dev endpoint and debug remotely if necessary. At the same time, if the (Python) code consists of multiple files and packages, all except the main script need to be zipped. All this gives me the feeling that Glue is not suitable for any complex ETL task as development and testing is cumbersome. I could test my Spark code locally without having to upload the code to S3 every time, and verify the tests on a CI server without having to pay for a development Glue endpoint.

  • Kyle
    Kyle over 5 years
    Worth noting that when Glue compiles your Scala job it may be a little different to the spark shell in a dev endpoint (ie, at the very least warnings are treated as fatal, which is not the case in the spark-shell).
  • lfk
    lfk over 5 years
    It doesn't seem to be suitable for production, business-critical tasks. I think it's mainly aimed at data scientists to run ad-hoc jobs and analytics. Nevertheless our AWS consultant tried really hard to convince us to use Glue instead of Spark on EMR.
  • SirKometa
    SirKometa over 4 years
    Did you have any luck using it?
  • Brian
    Brian over 4 years
    Yes, but only after disabling Hive Support (as of the non-accepted answer here: stackoverflow.com/a/45545595/3080611 ). Then I reran the bin/setup.py again from the aws glue repo to build the jars using Maven.
  • Adriaan
    Adriaan over 4 years
    Next time, when linking to your own blog, make it very, very clear it is your blog. Otherwise you run the risk of it being deleted as spam.
  • Servadac
    Servadac almost 4 years
    Could you explain how can be Docker used to launch local glue scripts? Or maybe point us to some documentation about it? Thanks!
  • selle
    selle over 3 years
    Those are unoffical dockers. There's an official one as well: aws.amazon.com/blogs/big-data/…
  • BRad
    BRad over 2 years
    outdated answer