AWS Glue export to parquet issue using glueContext.write_dynamic_frame.from_options

17,772

Part 1: identifying the problem

The solution how to find what is causing the problem was to switch output from .parquet to .csv and drop ResolveChoice or DropNullFields (as it is automatically suggested by Glue for .parquet):

datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://xxxx"}, format = "csv", transformation_ctx = "datasink2")
job.commit()

It has produced the more detailed error message:

An error occurred while calling o120.pyWriteDynamicFrame. Job aborted due to stage failure: Task 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID 182, ip-172-31-78-99.ec2.internal, executor 15): com.amazonaws.services.glue.util.FatalException: Unable to parse file: xxxx1.csv.gz

The file xxxx1.csv.gz mentioned in the error message appears to be too big for Glue (~100Mb .gzip and ~350Mb as uncompressed .csv).

Part 2: true source of the problem and fix

As mentioned in the 1st part thanks to export to .csv it was possible to identify the wrong file.

Further investigation by loading .csv into R has revealed that one of the columns contains a single string record, while all other values of this column were long or NULL.

After dropping this value in R and re-uploading data to S3 the problem vanished.

Note #1: the column was declared string in Athena so I consider this behaviour as bug

Note #2: the nature of the problem was not the size of the data. I have successfuly processed files up to 200Mb .csv.gz which correspond to roughtly 600 Mb .csv.

Share:
17,772
Alex Skorokhod
Author by

Alex Skorokhod

Love data science as allows for seeing through more than ever before. Enjoy solution for "consistent" people - done once, reused many times

Updated on June 21, 2022

Comments

  • Alex Skorokhod
    Alex Skorokhod almost 2 years

    I have the following problem.

    The code below is auto-generated by AWS Glue.

    It's mission is to data from Athena (backed up by .csv @ S3) and transform data into Parquet.

    The code is working for the reference flight dataset and for some relatively big tables (~100 Gb).

    However, in most cases it returns the error, which does not tell me much.

    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkConf, SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    
    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    
    conf = (SparkConf()
        .set("spark.driver.maxResultSize", "8g"))
    
    sc = SparkContext(conf=conf)
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "XXX", table_name = "csv_impressions", transformation_ctx = "datasource0")
    
    applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("event time", "long", "event_time", "long"), ("user id", "string", "user_id", "string"), ("advertiser id", "long", "advertiser_id", "long"), ("campaign id", "long", "campaign_id", "long")], transformation_ctx = "applymapping1")
    
    resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
    
    dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
    
    datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://xxxx"}, format = "parquet", transformation_ctx = "datasink4")
    job.commit()
    

    The error message identified by AWS Glue is:

    An error occurred while calling o72.pyWriteDynamicFrame

    The log file also contains:

    Job aborted due to stage failure: ... Task failed while writing rows

    Any idea how to find out the reason for failure?

    Or what it could be?

  • Krish
    Krish almost 6 years
    So how did you go about resolving it ?