AWS Glue write parquet with partitions

11,737

Solution 1

From AWS Support (paraphrasing a bit):

As of today, Glue does not support partitionBy parameter when writing to parquet. This is in the pipeline to be worked on though.

Using the Glue API to write to parquet is required for job bookmarking feature to work with S3 sources.

So as of today it is not possible to partition parquet files AND enable the job bookmarking feature.

Edit: today (3/23/18) I found in the documentations:

glue_context.write_dynamic_frame.from_options(
frame = projectedEvents,
connection_options = {"path": "$outpath", "partitionKeys": ["type"]},
format = "parquet")

That option may have always been there and both myself and the AWS support person missed it, or it was only added recently. Either way, it seems like it is possible now.

Solution 2

I use some of the columns from my dataframe as the partionkeys object:

glueContext.write_dynamic_frame \
    .from_options(
        frame = some_dynamic_dataframe, 
        connection_type = "s3", 
        connection_options =  {"path":"some_path", "partitionKeys": ["month", "day"]},
        format = "parquet")
Share:
11,737

Related videos on Youtube

stewart99
Author by

stewart99

Updated on May 21, 2022

Comments

  • stewart99
    stewart99 almost 2 years

    I am able to write to parquet format and partitioned by a column like so:

    jobname = args['JOB_NAME']
    #header is a spark DataFrame
    header.repartition(1).write.parquet('s3://bucket/aws-glue/{}/header/'.format(jobname), 'append', partitionBy='date')
    

    But I am not able to do this with Glue's DynamicFrame.

    header_tmp = DynamicFrame.fromDF(header, glueContext, "header")
    glueContext.write_dynamic_frame.from_options(frame = header_tmp, connection_type = "s3", connection_options = {"path": 's3://bucket/output/header/'}, format = "parquet")
    

    I have tried passing the partitionBy as a part of connection_options dict, since AWS docs say for parquet Glue does not support any format options, but that didn't work.

    Is this possible, and how? As for reasons for doing it this way, I thought it was needed for job bookmarking to work, as that is not working for me currently.

  • Alex Skorokhod
    Alex Skorokhod about 6 years
    Here is the quote from the most recent glue documentation: "Until recently the only way to write a DynamicFrame into partitions was to convert it to a Spark SQL DataFrame before writing. However, DynamicFrames now support native partitioning using a sequence of keys, using the partitionKeys option when creating a sink.". So YES, it was added just recently
  • Raj
    Raj about 5 years
    Still "partitionKeys": ["type"] feature is useless ...you will get an empty folder if you specify a partitioning key in this option. And if you remove this option then only your DF will get written in S3 with default no of partitions i.e. 200.
  • whatsinthename
    whatsinthename over 2 years
    Does it write in append mode?
  • whatsinthename
    whatsinthename over 2 years
    Does it write in append mode?
  • Omega
    Omega over 2 years
    For me this creates a file called my_partition=my_partition_value_$folder$ in my output folder. So it doesn't have a file extension, but it's also not a folder and the size is 0B. How do I fix this?