AWS Glue predicate push down condition has no effect
Solution 1
Pushdown predicate works for partitioning columns only. In other words, your data files should be placed in hierarchically structured folders. For example, if data is located in s3://bucket/dataset/
and partitioned by year, month and day then the structure should be following:
s3://bucket/dataset/year=2018/month=7/day=18/<data-files-here>
In such case pushdown predicate would work for columns year
, month
and day
only:
datasource = glueContext.create_dynamic_frame_from_catalog(
database = source_catalog_db,
table_name = source_catalog_tbl,
push_down_predicate = "year = 2017 and month > 6 and day between 3 and 10",
transformation_ctx = "datasource")
Besides that you have to keep in mind that pushdown predicates work with s3 data sources only.
Here is a nice blog post written by AWS Glue devs about data partitioning.
Solution 2
This is great! I was able to use it to obtain the last 30 days of data using my "dt" partition column:
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database = "my_db",
table_name = "my_table",
push_down_predicate = "to_date(dt) >= date_sub(current_date, 30)",
transformation_ctx = "datasource0"
)
I'm using Glue 1.0 - Spark 2.4 - Python 2.
Related videos on Youtube
Anas Ismail
Updated on September 16, 2022Comments
-
Anas Ismail about 1 year
I have a MySQL source from which I am creating a Glue Dynamic Frame with predicate push down condition as follows
datasource = glueContext.create_dynamic_frame_from_catalog( database = source_catalog_db, table_name = source_catalog_tbl, push_down_predicate = "id > 1531812324", transformation_ctx = "datasource")
I am always getting all the records in 'datasource' whatever the condition I put in 'push_down_predicate'. What am I missing?
-
Anas Ismail over 5 yearsThanks @Yuriy, It completely makes sense. I am now using Filter operation of Glue for narrowing down my results. It is not efficient since it loads the complete table in memory and then apply filters. But, this is the only option we have with Glue right now I believe.
-
Sujai Sivasamy almost 5 yearsIs there any way that I could apply push_down_predicate on RDS data sources?
-
Yuriy Bondaruk almost 5 years@Sujai No, unfortunately it works with s3 source only
-
Patrick Bray over 2 yearsIs this being pushed down into the SQL where? Trying to understand whether or not push down predicates are still not supported for JDBC
-
Patrick Bray over 2 yearsIs this still only supported for S3? Or can this be used for RDS now?