Spark from_json - StructType and ArrayType

11,608

Solution 1

The problem is that you don't have a fully qualified json. Your json is missing a couple of things:

  • First you are missing the surrounding {} in which the json is done
  • Second you are missing the variable value (you set it as "" but did not add it)
  • Lastly you are missing the closing ]

Try replacing it with:

val dfExample2= spark.sql("""select "{\"\":[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }]}" as theJson""")

and you will get:

scala> dfICanWorkWith.collect()
res12: Array[org.apache.spark.sql.Row] = Array([[WrappedArray([Johnny,Boy], [Franky,Man])]])

Solution 2

as of spark 2.4 the schema_of_json function helps:

> SELECT schema_of_json('[{"col":0}]');
  array<struct<col:int>>

in your case you can then use the below code to parse that array of son objects:

scala> spark.sql("""select from_json("[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }]", 'array<struct<FirstName:string,Surname:string>>' ) as theJson""").show(false)
+------------------------------+
|theJson                       |
+------------------------------+
|[[Johnny, Boy], [Franky, Man]]|
+------------------------------+
Share:
11,608
Dragick
Author by

Dragick

Updated on June 09, 2022

Comments

  • Dragick
    Dragick almost 2 years

    I have a data set that comes in as XML, and one of the nodes contains JSON. Spark is reading this in as a StringType, so I am trying to use from_json() to convert the JSON to a DataFrame.

    I am able to convert a string of JSON, but how do I write the schema to work with an Array?

    String without Array - Working nicely

    import org.apache.spark.sql.functions._
    
    val schemaExample = new StructType()
              .add("FirstName", StringType)
              .add("Surname", StringType)
    
    val dfExample = spark.sql("""select "{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }" as theJson""")
    
    val dfICanWorkWith = dfExample.select(from_json($"theJson", schemaExample))
    
    dfICanWorkWith.collect()
    
    // Results \\
    res19: Array[org.apache.spark.sql.Row] = Array([[Johnny,Boy]])
    

    String with an Array - Can't figure this one out

    import org.apache.spark.sql.functions._
    
    val schemaExample2 = new StructType()
                                  .add("", ArrayType(new StructType()
                                                              .add("FirstName", StringType)
                                                              .add("Surname", StringType)
                                                    )
                                      )
    
    val dfExample2= spark.sql("""select "[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }" as theJson""")
    
    val dfICanWorkWith = dfExample2.select(from_json($"theJson", schemaExample2))
    
    dfICanWorkWith.collect()
    
    // Result \\
    res22: Array[org.apache.spark.sql.Row] = Array([null])