How can I use an external python library in AWS Glue?

10,765

Solution 1

It depends if the job is Spark or Python Shell. For Spark you just need to zip the library and then when you point the job to the library S3 path, the job will import it. You just need to make sure that the zip contains this file: __init__.py

For example, for the library you are trying to import, if you download it from https://pypi.org/project/openpyxl/#files, you can zip the folder openpyxl inside the openpyxl-3.0.0.tar.gz, and store it in S3.


On the other hand, if it is a Python Shell job, a zip file will not work. You will need to create an egg file from the library. If you are using this version openpyxl-3.0.0, then you can download it from that same website, extract everything, and run the command python setup.py bdist_egg or python3 instead of python if you use python3 instead.

This will generate an egg file inside dist folder which is also generated. You just need to put that egg file in S3 and point the Glue Job Python Libraries to that path.

If you already have the library and for some reason you don't have the setup.py, then you must create it in order to run the command to generate the egg file. Please refer to http://www.blog.pythonlibrary.org/2012/07/12/python-101-easy_install-or-how-to-create-eggs/. There you can find an example.

Solution 2

You can now (as of Glue version 2) directly add external libraries using --additional-python-modules parameter.

For example to update or to add a new scikit-learn module use the following key/value:

"--additional-python-modules", "scikit-learn==0.21.3".

More details could be found in the docs.

Share:
10,765

Related videos on Youtube

Marlon Holland
Author by

Marlon Holland

Updated on June 04, 2022

Comments

  • Marlon Holland
    Marlon Holland almost 2 years

    First stack overflow question here. Hope I do this correctly:

    I need to use an external python library in AWS glue. "Openpyxl" is the name of the library.

    I follow these directions: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html

    However, after I have my zip file saved in the correct s3 location and point my glue job to that location, I'm not sure what to actually write in the script.

    I tried your typical Import openpyxl , but that just returns the following error:

    ImportError: No module named openpyxl
    

    Obviously I don't know what to do here - also relatively new to programming so I'm not sure if this is a noob question or what. Thanks in advance!

    • Sandeep Fatangare
      Sandeep Fatangare over 4 years
      Is it spark job or python shell job?
  • Sandeep Fatangare
    Sandeep Fatangare over 4 years
    For python shell, there is no need to download and bundle in egg file. You can use install_requires=['openpyxl==3.0.0'] in setup.py and it will download and install in glue during execution.
  • Aakash Basu
    Aakash Basu over 2 years
    It is not working for it, it still gives no module error. Any help?
  • Amit Naidu
    Amit Naidu about 2 years
    As Sandeep said, this build process is only needed for custom user libraries. Right now wheels work fine, so no eggs needed. I am still trying to understand their rationale for requiring different formats with Spark vs. Shell though. Would it make things too easy?