Which is the fastest way to read Json Files from S3 : Spark

14,581

If your JSON is uniformly structured I would advise you to give Spark the schema for your JSON files and this should speed up processing tremendously.

When you don't supply a schema Spark will read all of the lines in the file first to infer the schema which, as you have observed, can take a while.

See this documentation for how to create a schema: http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

Then you'd just have to add the schema you created to the jsonFile call:

val df = sqlContext.jsonFile("s3://testData/*/*/*", mySchema)

At this time (I'm using Spark 1.6.2) it seems as if jsonFile has been deprecated, so switching to sqlContext.read.schema(mySchema).json(myJsonRDD) (where myJsonRDD is of type RDD[String]) might be preferable.

Share:
14,581
Admin
Author by

Admin

Updated on June 22, 2022

Comments

  • Admin
    Admin almost 2 years

    I have a directory with folders and each folder contains compressed JSON file (.gz). Currently I am doing like:

    val df = sqlContext.jsonFile("s3://testData/*/*/*")
    df.show()
    

    Eg:

    testData/May/01/00/File.json.gz
    

    Each compressed file is about 11 to 17 GB.

    I have:

    1. Master: 1 c3.4xlarge
    2. Core: 19 c3.4xlarge
    3. Spark 1.5.2
    4. emr-4.2.0

    The compressed files have multiple json objects/file. This process takes huge amount of time just to read (just the the above two statements). Is there any faster way to do this? The schema is little complex as well. I am planning to write some queries to analysis the data set. But I am worried about the time it takes to read data from s3.

    Maximum load can be 10TB. I am planning to use cache to process queries later.

  • satoukum
    satoukum over 7 years
    I created a schema to see if using one would speed up the process of reading a file, and my query went from executing in 44s to taking 47m to execute. There could be some confounding variable for the load on the cluster at that time, but I thought it was interesting that it took so much longer. I am running Spark version 1.5.1.
  • ADAM
    ADAM about 7 years
    For those using newer versions of Spark sqlContext.jsonFile("...") is deprecated. Use sqlContext.read.json("...") instead.