SparkSQL and explode on DataFrame in Java

12,550

Solution 1

It seems it is possible to use a combination of org.apache.spark.sql.functions.explode(Column col) and DataFrame.withColumn(String colName, Column col) to replace the column with the exploded version of it.

Solution 2

I solved it in this manner: say that you have an array column containing job descriptions named "positions", for each person with "fullName".

Then you get from initial schema :

root
|-- fullName: string (nullable = true)
|-- positions: array (nullable = true)
    |    |-- element: struct (containsNull = true)
    |    |    |-- companyName: string (nullable = true)
    |    |    |-- title: string (nullable = true)
...

to schema:

root
 |-- personName: string (nullable = true)
 |-- companyName: string (nullable = true)
 |-- positionTitle: string (nullable = true)

by doing:

    DataFrame personPositions = persons.select(persons.col("fullName").as("personName"),
          org.apache.spark.sql.functions.explode(persons.col("positions")).as("pos"));

    DataFrame test = personPositions.select(personPositions.col("personName"),
    personPositions.col("pos").getField("companyName").as("companyName"), personPositions.col("pos").getField("title").as("positionTitle"));
Share:
12,550
JiriS
Author by

JiriS

Updated on July 22, 2022

Comments

  • JiriS
    JiriS almost 2 years

    Is there an easy way how use explode on array column on SparkSQL DataFrame? It's relatively simple in Scala, but this function seems to be unavailable (as mentioned in javadoc) in Java.

    An option is to use SQLContext.sql(...) and explode function inside the query, but I'm looking for a bit better and especially cleaner way. DataFrames are loaded from parquet files.

  • SPS
    SPS almost 3 years
    in java 8 and Spark 2.4.7 the explode(..) method returns a Column and not a DataFrame. Can you please specify your versions where explode is giving a dataframe to you?