How to use Spark SQL to parse the JSON array of objects
20,181
From your given json
data you can view the schema of your dataframe
with printSchema
and use it
appActiveTime.printSchema()
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- activetime: long (nullable = true)
| | |-- package: string (nullable = true)
Since you have array
you need to explode
the data and select the struct field as below
import org.apache.spark.sql.functions._
appActiveTime.withColumn("data", explode($"data"))
.select("data.*")
.show(false)
Output:
+----------+------------+
|activetime| package|
+----------+------------+
| 60000|com.browser1|
| 1205000|com.browser6|
| 1205000|com.browser7|
| 60000|com.browser1|
| 1205000|com.browser6|
+----------+------------+
Hope this helps!
Author by
gonefuture
Updated on July 09, 2022Comments
-
gonefuture almost 2 years
now has JSON data as follows
{"Id":11,"data":[{"package":"com.browser1","activetime":60000},{"package":"com.browser6","activetime":1205000},{"package":"com.browser7","activetime":1205000}]} {"Id":12,"data":[{"package":"com.browser1","activetime":60000},{"package":"com.browser6","activetime":1205000}]} ......
This JSON is the activation time of app, the purpose of which is to analyze the total activation time of each app
I use sparK SQL to parse JSON
scala
val sqlContext = sc.sqlContext val behavior = sqlContext.read.json("behavior-json.log") behavior.cache() behavior.createOrReplaceTempView("behavior") val appActiveTime = sqlContext.sql ("SELECT data FROM behavior") // SQL query appActiveTime.show (100100) // print dataFrame appActiveTime.rdd.foreach(println) // print RDD
But the printed dataFrame is like this
.
+----------------------------------------------------------------------+ | data| +----------------------------------------------------------------------+ | [[60000, com.browser1], [12870000, com.browser]]| | [[60000, com.browser1], [120000, com.browser]]| | [[60000, com.browser1], [120000, com.browser]]| | [[60000, com.browser1], [1207000, com.browser]]| | [[120000, com.browser]]| | [[60000, com.browser1], [1204000, com.browser5]]| | [[60000, com.browser1], [12075000, com.browser]]| | [[60000, com.browser1], [120000, com.browser]]| | [[60000, com.browser1], [1204000, com.browser]]| | [[60000, com.browser1], [120000, com.browser]]| | [[60000, com.browser1], [1201000, com.browser]]| | [[1200400, com.browser5]]| | [[60000, com.browser1], [1200400, com.browser]]| |[[60000, com.browser1], [1205000, com.browser6], [1205000, com.browser7]]|
.
RDD is like this
.
[WrappedArray ([60000, com.browser1], [60000, com.browser1])] [WrappedArray ([120000, com.browser])] [WrappedArray ([60000, com.browser1], [1204000, com.browser5])] [WrappedArray ([12075000, com.browser], [12075000, com.browser])]
.
And I want to turn the data into
.
Com.browser1 60000 Com.browser1 60000 Com.browser 12075000 Com.browser 12075000 ...
.
I want to turn the array elements of each line in RDD into one row. Of course, it can be another structure that is easy to analyze.
Because I only learn spark and Scala a lot, I have try it for a long time but fail, so I hope you can guide me.
-
gonefuture about 6 yearsthank you very munch ! I got it. I will learn more about spark sql.
-
Anand almost 3 years@koiralo this explode seems to take so much of time on large data.. any other way to do that