How to use Spark SQL to parse the JSON array of objects

20,181

From your given json data you can view the schema of your dataframe with printSchema and use it

appActiveTime.printSchema()
root
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- activetime: long (nullable = true)
 |    |    |-- package: string (nullable = true)

Since you have array you need to explode the data and select the struct field as below

import org.apache.spark.sql.functions._
appActiveTime.withColumn("data", explode($"data"))
       .select("data.*")
       .show(false)

Output:

+----------+------------+
|activetime|     package|
+----------+------------+
|     60000|com.browser1|
|   1205000|com.browser6|
|   1205000|com.browser7|
|     60000|com.browser1|
|   1205000|com.browser6|
+----------+------------+

Hope this helps!

Share:
20,181
gonefuture
Author by

gonefuture

Updated on July 09, 2022

Comments

  • gonefuture
    gonefuture almost 2 years

    now has JSON data as follows

    {"Id":11,"data":[{"package":"com.browser1","activetime":60000},{"package":"com.browser6","activetime":1205000},{"package":"com.browser7","activetime":1205000}]}
    {"Id":12,"data":[{"package":"com.browser1","activetime":60000},{"package":"com.browser6","activetime":1205000}]} 
    ......
    

    This JSON is the activation time of app, the purpose of which is to analyze the total activation time of each app

    I use sparK SQL to parse JSON

    scala

    val sqlContext = sc.sqlContext
    val behavior = sqlContext.read.json("behavior-json.log")
    behavior.cache()
    behavior.createOrReplaceTempView("behavior")
    val appActiveTime = sqlContext.sql ("SELECT data FROM behavior") // SQL query
    appActiveTime.show (100100) // print dataFrame
    appActiveTime.rdd.foreach(println) // print RDD
    

    But the printed dataFrame is like this

    .

    +----------------------------------------------------------------------+
    
    | data|
    
    +----------------------------------------------------------------------+
    
    | [[60000, com.browser1], [12870000, com.browser]]|
    
    | [[60000, com.browser1], [120000, com.browser]]|
    
    | [[60000, com.browser1], [120000, com.browser]]|
    
    | [[60000, com.browser1], [1207000, com.browser]]|
    
    | [[120000, com.browser]]|
    
    | [[60000, com.browser1], [1204000, com.browser5]]|
    
    | [[60000, com.browser1], [12075000, com.browser]]|
    
    | [[60000, com.browser1], [120000, com.browser]]|
    
    | [[60000, com.browser1], [1204000, com.browser]]|
    
    | [[60000, com.browser1], [120000, com.browser]]|
    
    | [[60000, com.browser1], [1201000, com.browser]]|
    
    | [[1200400, com.browser5]]|
    
    | [[60000, com.browser1], [1200400, com.browser]]|
    
    |[[60000, com.browser1], [1205000, com.browser6], [1205000, com.browser7]]|
    

    .

    RDD is like this

    .

    [WrappedArray ([60000, com.browser1], [60000, com.browser1])]
    
    [WrappedArray ([120000, com.browser])]
    
    [WrappedArray ([60000, com.browser1], [1204000, com.browser5])]
    
    [WrappedArray ([12075000, com.browser], [12075000, com.browser])]
    

    .

    And I want to turn the data into

    .

    Com.browser1 60000
    
    Com.browser1 60000
    
    Com.browser 12075000
    
    Com.browser 12075000
    
    ...
    

    .

    I want to turn the array elements of each line in RDD into one row. Of course, it can be another structure that is easy to analyze.

    Because I only learn spark and Scala a lot, I have try it for a long time but fail, so I hope you can guide me.

  • gonefuture
    gonefuture about 6 years
    thank you very munch ! I got it. I will learn more about spark sql.
  • Anand
    Anand almost 3 years
    @koiralo this explode seems to take so much of time on large data.. any other way to do that