How to explode columns?

25,189

Solution 1

A solution that doesn't convert to and from RDD:

df.select($"Col1", $"Col2"(0) as "Col2", $"Col2"(1) as "Col3", $"Col2"(2) as "Col3")

Or arguable nicer:

val nElements = 3
df.select(($"Col1" +: Range(0, nElements).map(idx => $"Col2"(idx) as "Col" + (idx + 2)):_*))

The size of a Spark array column is not fixed, you could for instance have:

+----+------------+
|Col1|        Col2|
+----+------------+
|   1|   [2, 3, 4]|
|   1|[2, 3, 4, 5]|
+----+------------+

So there is no way to get the amount of columns and create those. If you know the size is always the same, you can set nElements like this:

val nElements = df.select("Col2").first.getList(0).size

Solution 2

Just to give the Pyspark version of sgvd's answer. If the array column is in Col2, then this select statement will move the first nElements of each array in Col2 to their own columns:

from pyspark.sql import functions as F            
df.select([F.col('Col2').getItem(i) for i in range(nElements)])

Solution 3

Just add on to sgvd's solution:

If the size is not always the same, you can set nElements like this:

val nElements = df.select(size('Col2).as("Col2_count"))
                  .select(max("Col2_count"))
                  .first.getInt(0)

25,189

Author by

Jorge Machado

graduated on Informatic engineering

Updated on July 09, 2022

Comments

Jorge Machado almost 2 years

After:

val df = Seq((1, Vector(2, 3, 4)), (1, Vector(2, 3, 4))).toDF("Col1", "Col2")

I have this DataFrame in Apache Spark:

+------+---------+
| Col1 | Col2    |
+------+---------+
|  1   |[2, 3, 4]|
|  1   |[2, 3, 4]|
+------+---------+

How do I convert this into:

+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 |
+------+------+------+------+
|  1   |  2   |  3   |  4   |
|  1   |  2   |  3   |  4   |
+------+------+------+------+

Jorge Machado almost 8 years

What if I don't know how many colums I have on the array ? I would Like to have something like explodeToColums
Mohammad Adnan over 7 years

Nice answer. is there way to use advance DataSet APIs to achieve above in more type safe way? For e.g. how can I get rid of $"Col1"?
sgvd over 7 years

What do you mean exactly? I guess not getting rid of it by just not adding it to the select :) I guess you don't want to name it explicitly? But I'm not sure how you see type safety coming into play necessarily. It is probably worth it to create a new question with a minimal example of what you want to do.
Mohammad Adnan over 7 years

sure i'll add another question. but what i mean is DataSet of spark provides good compile time safety. Where as $"col1" would be evaluated at run time. So if col1 is not present in dataset I get compile time error in select statement rather than getting at runtime.
user422930 over 5 years

how would you add "Col1" to the select statement?
Shane Halloran over 5 years

@user422930, thank-you for your question, sorry I only saw it now. Try doing something like: from pyspark.sql import functions as F df.select([F.col('Col1')]+[F.col('Col2').getItem(i) for i in range(nElements)]) I haven't tested it, but let me know if it works or not.