DynamicFrame vs DataFrame

25,906

Solution 1

DynamicFrame is safer when handling memory intensive jobs. "The executor memory with AWS Glue dynamic frames never exceeds the safe threshold," while on the other hand, Spark DataFrame could hit "Out of memory" issue on executors. (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html)

DynamicFrames are designed to provide maximum flexibility when dealing with messy data that may lack a declared schema. Records are represented in a flexible self-describing way that preserves information about schema inconsistencies in the data.

For example, with changing requirements, an address column stored as a string in some records might be stored as a struct in later rows. Rather than failing or falling back to a string, DynamicFrames will track both types and gives users a number of options in how to resolve these inconsistencies, providing fine grain resolution options via the ResolveChoice transforms.

DynamicFrames also provide a number of powerful high-level ETL operations that are not found in DataFrames. For example, the Relationalize transform can be used to flatten and pivot complex nested data into tables suitable for transfer to a relational database. In additon, the ApplyMapping transform supports complex renames and casting in a declarative fashion.

DynamicFrames are also integrated with the AWS Glue Data Catalog, so creating frames from tables is a simple operation. Writing to databases can be done through connections without specifying the password. Moreover, DynamicFrames are integrated with job bookmarks, so running these scripts in the job system can allow the script to implictly keep track of what was read and written.(https://github.com/aws-samples/aws-glue-samples/blob/master/FAQ_and_How_to.md)

Solution 2

You can refer to the documentation here: DynamicFrame Class. It says,

A DynamicFrame is similar to a DataFrame, except that each record is self-describing, so no schema is required initially.

You want to use DynamicFrame when,

  • Data that does not conform to a fixed schema.

Note: You can also convert the DynamicFrame to DataFrame using toDF()

Share:
25,906

Related videos on Youtube

Alex Oh
Author by

Alex Oh

Updated on July 16, 2022

Comments

  • Alex Oh
    Alex Oh over 1 year

    What is the difference? I know that DynamicFrame was created for AWS Glue, but AWS Glue also supports DataFrame. When should DynamicFrame be used in AWS Glue?

    • Kyle
      Kyle over 5 years
      You may also want to use a dynamic frame just for the ability to load from the supported sources such as S3 and use job bookmarking to capture only new data each time a job runs.
  • Davos
    Davos over 4 years
    Great summary, I'd also add that DyF are a high level abstraction over Spark DF and are a great place to start. Most of the generated code will use the DyF. When something advanced is required then you can convert to Spark DF easily and continue and back to DyF if required. The biggest downside is that it is a proprietary API and you can't pick up your code and run it easily on another vendor Spark cluster like Databricks, Cloudera, Azure etc.
  • ruifgmonteiro
    ruifgmonteiro over 3 years
    I noticed that applying the toDF() method to a dynamic frame takes several minutes when the amount of data is large. What can we do to make it faster besides adding more workers to the job?
  • SantiagoRodriguez
    SantiagoRodriguez over 3 years
    In my case, I bypassed this by discarding DynamicFrames, because data type integrity was guarateed, so just used spark.read interface.
  • Brian
    Brian over 2 years
    sounds like marketing to me.
  • Brian
    Brian about 2 years
    it would be better to avoid back and forth conversions as much as possible.
  • Brian
    Brian about 2 years
    A dataframe will have a set schema (schema on read). Your data can be nested, but it must be schema on read. In the case where you can't do schema on read a dataframe will not work. I have yet to see a case where someone is using dynamic frames for something that varies so much that it can't be schema on read... I guess the only option then for non glue users is to then use RDD's. I would love to see a benchmark of dynamic frames vrs dataframes.. ;-) all those cool additions made to dataframes that reduce shuffle ect..