SparkSQL and explode on DataFrame in Java

java apache-spark apache-spark-sql

12,550

Solution 1

It seems it is possible to use a combination of org.apache.spark.sql.functions.explode(Column col) and DataFrame.withColumn(String colName, Column col) to replace the column with the exploded version of it.

Solution 2

I solved it in this manner: say that you have an array column containing job descriptions named "positions", for each person with "fullName".

Then you get from initial schema :

root
|-- fullName: string (nullable = true)
|-- positions: array (nullable = true)
    |    |-- element: struct (containsNull = true)
    |    |    |-- companyName: string (nullable = true)
    |    |    |-- title: string (nullable = true)
...

to schema:

root
 |-- personName: string (nullable = true)
 |-- companyName: string (nullable = true)
 |-- positionTitle: string (nullable = true)

by doing:

    DataFrame personPositions = persons.select(persons.col("fullName").as("personName"),
          org.apache.spark.sql.functions.explode(persons.col("positions")).as("pos"));

    DataFrame test = personPositions.select(personPositions.col("personName"),
    personPositions.col("pos").getField("companyName").as("companyName"), personPositions.col("pos").getField("title").as("positionTitle"));

12,550

Author by

JiriS

Updated on July 22, 2022

Comments

JiriS almost 2 years

Is there an easy way how use explode on array column on SparkSQL DataFrame? It's relatively simple in Scala, but this function seems to be unavailable (as mentioned in javadoc) in Java.

An option is to use SQLContext.sql(...) and explode function inside the query, but I'm looking for a bit better and especially cleaner way. DataFrames are loaded from parquet files.
SPS almost 3 years

in java 8 and Spark 2.4.7 the explode(..) method returns a Column and not a DataFrame. Can you please specify your versions where explode is giving a dataframe to you?