Decompress a zip file in AWS Glue

11,987

Solution 1

Glue can do decompression. But it wouldn't be optimal. As gzip format is not splittable (that mean only one executor will work with it). More info about that here.

You can try to decompression by lambda and invoke glue crawler for new folder.

Solution 2

Use gluecontext.create_dynamic_frame.from_options and mention compression type in connection options. Similarly output can also be compressed while writing to s3. The below snippet worked for bzip, please change format to gz|gzip and try.

I tried the Target Location in UI of glue console and found bzip and gzip are supported in writing dynamic_frames to s3 and made changes to the code generated to read a compressed file from s3. In docs it is not directly available.

Not sure about the efficiency. It took around 180 seconds of execution time to read, Map transform, change to dataframe and back to dynamicframe for a 400mb compressed csv file in bzip format. Please note execution time is different from start_time and end_time shown in console.

datasource0 = glueContext.create_dynamic_frame
  .from_options('s3',
     {
       'paths': ['s3://bucketname/folder/filename_20180218_004625.bz2'],
       'compression':'bzip'
     },
     'csv',
     {
       'separator': ';'
     }
  )

11,987

Author by

Yuva

A tech-savy professional having more than 20 years of development/project management / team building activities. Currently exploring more into AWS Services & Big data components for real time streaming with batch processing of data. My hobbies include watching cricket, listen to instrumental music, watch western-classical fusion concerts.

Updated on December 04, 2022

Comments

Yuva over 1 year

I have a compressed gzip file in an S3 bucket. The files will be uploaded to the S3 bucket daily by the client. The gzip when uncompressed will contain 10 files in CSV format, but with the same schema only. I need to uncompress the gzip file, and using Glue->Data crawler, need to create a schema before running a ETL script using a dev. endpoint.

Is glue capable to decompress the zip file and create a data catalog. Or any glue library available which we can use directly in the python ETL script? or should I opt for an Lambda/any other utility so that as soon as the zip file is uploaded, I run a utility to decompress and provide as a input to Glue?

Appreciate any replies.
Yuva about 6 years

Thanks Natalia, I accept your answer, as I was looking for a yes/no confirmation for decompression in Glue, and you answered it as yes. Do you have any code snippet or procedures for applying decompression using Glue. I have a use case and am looking for some solutions please, so I can try if it helps.
Laerte Junior over 5 years

Hey @Arun , I am in a situation where I need to use the GZIP and it´s is not a CSV file and the delimiters are space and not ;. I guess your snippet can help me but to be honest I dont know where to put. I am starting with GLUE. Is it is some option in crawlers in the UI ?
Arun Ramachandran over 5 years

@LaerteJunior For space- separator : ' ', tab - separator : '/t' can be used. There will be UI in which a script will be autogenerated or you can implement own script. In the script you can change the datasource code with the given snippet.
Laerte Junior over 5 years

thanks my friend I will try to find it. Just a question it is in the crawlers ?
Arun Ramachandran over 5 years

1. The code snippet in the answer can be used by creating a job by selecting ETL in the left menu and necessary fields. After that a code snippet will be created. In that code snippet we need to make changes to fetch data from this zip and use space as filter. 2. Crawler is just used create metadata about the data in s3. Crawler without a configuration for zip can crawl and create metadata of compressed files.