AWS Glue: How to handle nested JSON with varying schemas

22,449

Solution 1

I'm not sure you can do this with a table definition, but you can accomplish this with an ETL job by using a mapping function to cast the top level values as JSON strings. Documentation: [link]

import json

# Your mapping function
def flatten(rec):
    for key in rec:
        rec[key] = json.dumps(rec[key])
    return rec

old_df = glueContext.create_dynamic_frame.from_options(
    's3',
    {"paths": ['s3://...']},
    "json")

# Apply mapping function f to all DynamicRecords in DynamicFrame
new_df = Map.apply(frame=old_df, f=flatten)

From here you have the option of exporting to S3 (perhaps in Parquet or some other columnar format to optimize for querying) or directly into Redshift from my understanding, although I haven't tried it.

Solution 2

The procedure I found useful to shallow nested json:

  1. ApplyMapping for the first level as datasource0;

  2. Explode struct or array objects to get rid of element level df1 = datasource0.toDF().select(id,col1,col2,...,explode(coln).alias(coln), where explode requires from pyspark.sql.functions import explode;

  3. Select the JSON objects that you would like to keep intact by intact_json = df1.select(id, itct1, itct2,..., itctm);

  4. Transform df1 back to dynamicFrame and Relationalize the dynamicFrame as well as drop the intact columns by dataframe.drop_fields(itct1, itct2,..., itctm);

  5. Join relationalized table with the intact table based on 'id' column.

Solution 3

As of 12/20/2018, I was able to manually define a table with first level json fields as columns with type STRING. Then in the glue script the dynamicframe has the column as a string. From there, you can do an Unbox operation of type json on the fields. This will json parse the fields and derive the real schema. Combining Unbox with Filter allows you to loop through and process heterogeneous json schemas from the same input if you can loop through a list of schemas.

However, one word of caution, this is incredibly slow. I think that glue is downloading the source files from s3 during each iteration of the loop. I've been trying to find a way to persist the initial source data but it looks like .toDF derives the schema of the string json fields even if you specify them as glue StringType. I'll add a comment here if I can figure out a solution with better performance.

Share:
22,449
ehelander
Author by

ehelander

Updated on August 08, 2020

Comments

  • ehelander
    ehelander over 3 years

    Objective: We're hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse via Redshift Spectrum.

    Background: The JSON data is from DynamoDB Streams and is deeply nested. The first level of JSON has a consistent set of elements: Keys, NewImage, OldImage, SequenceNumber, ApproximateCreationDateTime, SizeBytes, and EventName. The only variation is that some records do not have a NewImage and some don't have an OldImage. Below this first level, though, the schema varies widely.

    Ideally, we would like to use Glue to only parse this first level of JSON, and basically treat the lower levels as large STRING objects (which we would then parse as needed with Redshift Spectrum). Currently, we're loading the entire record into a single VARCHAR column in Redshift, but the records are nearing the maximum size for a data type in Redshift (maximum VARCHAR length is 65535). As a result, we'd like to perform this first level of parsing before the records hit Redshift.

    What we've tried/referenced so far:

    • Pointing the AWS Glue Crawler to the S3 bucket results in hundreds of tables with a consistent top level schema (the attributes listed above), but varying schemas at deeper levels in the STRUCT elements. We have not found a way to create a Glue ETL Job that would read from all of these tables and load it into a single table.
    • Creating a table manually has not been fruitful. We tried setting each column to a STRING data type, but the job did not succeed in loading data (presumably since this would involve some conversion from STRUCTs to STRINGs). When setting columns to STRUCT, it requires a defined schema - but this is precisely what varies from one record to another, so we are not able to provide a generic STRUCT schema that works for all the records in question.
    • The AWS Glue Relationalize transform is intriguing, but not what we're looking for in this scenario (since we want to keep some of the JSON intact, rather than flattening it entirely). Redshift Spectrum supports scalar JSON data as of a couple weeks ago, but this does not work with the nested JSON we're dealing with. Neither of these appear to help with handling the hundreds of tables created by the Glue Crawler.

    Question: How would we use Glue (or some other method) to allow us to parse just the first level of these records - while ignoring the varying schemas below the elements at the top level - so that we can access it from Spectrum or load it physically into Redshift?

    I'm new to Glue. I've spent quite a bit of time in the Glue documentation and looking through (the somewhat sparse) info on forums. I could be missing something obvious - or perhaps this is a limitation of Glue in its current form. Any recommendations are welcome.

    Thanks!