AWS Glue predicate push down condition has no effect

11,791

Solution 1

Pushdown predicate works for partitioning columns only. In other words, your data files should be placed in hierarchically structured folders. For example, if data is located in s3://bucket/dataset/ and partitioned by year, month and day then the structure should be following:

s3://bucket/dataset/year=2018/month=7/day=18/<data-files-here>

In such case pushdown predicate would work for columns year, month and day only:

datasource = glueContext.create_dynamic_frame_from_catalog(
    database = source_catalog_db, 
    table_name = source_catalog_tbl, 
    push_down_predicate = "year = 2017 and month > 6 and day between 3 and 10", 
    transformation_ctx = "datasource")

Besides that you have to keep in mind that pushdown predicates work with s3 data sources only.

Here is a nice blog post written by AWS Glue devs about data partitioning.

Solution 2

This is great! I was able to use it to obtain the last 30 days of data using my "dt" partition column:

datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "my_db",
    table_name = "my_table",
    push_down_predicate = "to_date(dt) >= date_sub(current_date, 30)", 
    transformation_ctx = "datasource0"
)

I'm using Glue 1.0 - Spark 2.4 - Python 2.

Share:
11,791

Related videos on Youtube

Anas Ismail
Author by

Anas Ismail

Updated on September 16, 2022

Comments

  • Anas Ismail
    Anas Ismail about 1 year

    I have a MySQL source from which I am creating a Glue Dynamic Frame with predicate push down condition as follows

    datasource = glueContext.create_dynamic_frame_from_catalog(
        database = source_catalog_db, 
        table_name = source_catalog_tbl, 
        push_down_predicate = "id > 1531812324", 
        transformation_ctx = "datasource")
    

    I am always getting all the records in 'datasource' whatever the condition I put in 'push_down_predicate'. What am I missing?

  • Anas Ismail
    Anas Ismail over 5 years
    Thanks @Yuriy, It completely makes sense. I am now using Filter operation of Glue for narrowing down my results. It is not efficient since it loads the complete table in memory and then apply filters. But, this is the only option we have with Glue right now I believe.
  • Sujai Sivasamy
    Sujai Sivasamy almost 5 years
    Is there any way that I could apply push_down_predicate on RDS data sources?
  • Yuriy Bondaruk
    Yuriy Bondaruk almost 5 years
    @Sujai No, unfortunately it works with s3 source only
  • Patrick Bray
    Patrick Bray over 2 years
    Is this being pushed down into the SQL where? Trying to understand whether or not push down predicates are still not supported for JDBC
  • Patrick Bray
    Patrick Bray over 2 years
    Is this still only supported for S3? Or can this be used for RDS now?