AWS Glue: ETL to read S3 CSV files

15,474

I believe the issue here is that you have subfolders within testing-csv folder and since you did not specify recurse to be true, Glue is not able to find the files in the 2018-09-26 subfolder (or in fact any other subfolders).

You need to add the recurse option as follows

inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://pinfare-glue/testing-csv"], "recurse"=True}, format = "csv")

Also, regarding your question about crawlers in the comments, they help to infer the schema of your data files. So, in your case here does nothing since you are creating the dynamicFrame directly from s3.

Share:
15,474
Jiew Meng
Author by

Jiew Meng

Web Developer & Computer Science Student Tools of Trade: PHP, Symfony MVC, Doctrine ORM, HTML, CSS, jQuery/JS Looking at Python/Google App Engine, C#/WPF/Entity Framework I hope to develop usable web applications like Wunderlist, SpringPad in the future

Updated on June 12, 2022

Comments

  • Jiew Meng
    Jiew Meng almost 2 years

    I want to use ETL to read data from S3. Since with ETL jobs I can set DPU to hopefully speed things up.

    But how do I do it? I tried

    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    
    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://pinfare-glue/testing-csv"]}, format = "csv")
    outputGDF = glueContext.write_dynamic_frame.from_options(frame = inputGDF, connection_type = "s3", connection_options = {"path": "s3://pinfare-glue/testing-output"}, format = "parquet")
    

    But it appears there is nothing written. My folder looks like:

    enter image description here

    Whats incorrect? My output S3 only has a file like: testing_output_$folder$

  • Shawnzam
    Shawnzam over 3 years
    "recurse": True}