How to explode columns?

25,189

Solution 1

A solution that doesn't convert to and from RDD:

df.select($"Col1", $"Col2"(0) as "Col2", $"Col2"(1) as "Col3", $"Col2"(2) as "Col3")

Or arguable nicer:

val nElements = 3
df.select(($"Col1" +: Range(0, nElements).map(idx => $"Col2"(idx) as "Col" + (idx + 2)):_*))

The size of a Spark array column is not fixed, you could for instance have:

+----+------------+
|Col1|        Col2|
+----+------------+
|   1|   [2, 3, 4]|
|   1|[2, 3, 4, 5]|
+----+------------+

So there is no way to get the amount of columns and create those. If you know the size is always the same, you can set nElements like this:

val nElements = df.select("Col2").first.getList(0).size

Solution 2

Just to give the Pyspark version of sgvd's answer. If the array column is in Col2, then this select statement will move the first nElements of each array in Col2 to their own columns:

from pyspark.sql import functions as F            
df.select([F.col('Col2').getItem(i) for i in range(nElements)])

Solution 3

Just add on to sgvd's solution:

If the size is not always the same, you can set nElements like this:

val nElements = df.select(size('Col2).as("Col2_count"))
                  .select(max("Col2_count"))
                  .first.getInt(0)
Share:
25,189
Jorge Machado
Author by

Jorge Machado

graduated on Informatic engineering

Updated on July 09, 2022

Comments

  • Jorge Machado
    Jorge Machado almost 2 years

    After:

    val df = Seq((1, Vector(2, 3, 4)), (1, Vector(2, 3, 4))).toDF("Col1", "Col2")
    

    I have this DataFrame in Apache Spark:

    +------+---------+
    | Col1 | Col2    |
    +------+---------+
    |  1   |[2, 3, 4]|
    |  1   |[2, 3, 4]|
    +------+---------+
    

    How do I convert this into:

    +------+------+------+------+
    | Col1 | Col2 | Col3 | Col4 |
    +------+------+------+------+
    |  1   |  2   |  3   |  4   |
    |  1   |  2   |  3   |  4   |
    +------+------+------+------+
    
  • Jorge Machado
    Jorge Machado almost 8 years
    What if I don't know how many colums I have on the array ? I would Like to have something like explodeToColums
  • Mohammad Adnan
    Mohammad Adnan over 7 years
    Nice answer. is there way to use advance DataSet APIs to achieve above in more type safe way? For e.g. how can I get rid of $"Col1"?
  • sgvd
    sgvd over 7 years
    What do you mean exactly? I guess not getting rid of it by just not adding it to the select :) I guess you don't want to name it explicitly? But I'm not sure how you see type safety coming into play necessarily. It is probably worth it to create a new question with a minimal example of what you want to do.
  • Mohammad Adnan
    Mohammad Adnan over 7 years
    sure i'll add another question. but what i mean is DataSet of spark provides good compile time safety. Where as $"col1" would be evaluated at run time. So if col1 is not present in dataset I get compile time error in select statement rather than getting at runtime.
  • user422930
    user422930 over 5 years
    how would you add "Col1" to the select statement?
  • Shane Halloran
    Shane Halloran over 5 years
    @user422930, thank-you for your question, sorry I only saw it now. Try doing something like: from pyspark.sql import functions as F df.select([F.col('Col1')]+[F.col('Col2').getItem(i) for i in range(nElements)]) I haven't tested it, but let me know if it works or not.