Which is the fastest way to read Json Files from S3 : Spark
If your JSON is uniformly structured I would advise you to give Spark the schema for your JSON files and this should speed up processing tremendously.
When you don't supply a schema Spark will read all of the lines in the file first to infer the schema which, as you have observed, can take a while.
See this documentation for how to create a schema: http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
Then you'd just have to add the schema you created to the jsonFile call:
val df = sqlContext.jsonFile("s3://testData/*/*/*", mySchema)
At this time (I'm using Spark 1.6.2) it seems as if jsonFile
has been deprecated, so switching to sqlContext.read.schema(mySchema).json(myJsonRDD)
(where myJsonRDD
is of type RDD[String]
) might be preferable.
Admin
Updated on June 22, 2022Comments
-
Admin almost 2 years
I have a directory with folders and each folder contains compressed JSON file (
.gz
). Currently I am doing like:val df = sqlContext.jsonFile("s3://testData/*/*/*") df.show()
Eg:
testData/May/01/00/File.json.gz
Each compressed file is about 11 to 17 GB.
I have:
- Master: 1 c3.4xlarge
- Core: 19 c3.4xlarge
- Spark 1.5.2
- emr-4.2.0
The compressed files have multiple json objects/file. This process takes huge amount of time just to read (just the the above two statements). Is there any faster way to do this? The schema is little complex as well. I am planning to write some queries to analysis the data set. But I am worried about the time it takes to read data from s3.
Maximum load can be 10TB. I am planning to use cache to process queries later.
-
satoukum over 7 yearsI created a schema to see if using one would speed up the process of reading a file, and my query went from executing in 44s to taking 47m to execute. There could be some confounding variable for the load on the cluster at that time, but I thought it was interesting that it took so much longer. I am running Spark version 1.5.1.
-
ADAM about 7 yearsFor those using newer versions of Spark
sqlContext.jsonFile("...")
is deprecated. UsesqlContext.read.json("...")
instead.