How to explode columns?
25,189
Solution 1
A solution that doesn't convert to and from RDD:
df.select($"Col1", $"Col2"(0) as "Col2", $"Col2"(1) as "Col3", $"Col2"(2) as "Col3")
Or arguable nicer:
val nElements = 3
df.select(($"Col1" +: Range(0, nElements).map(idx => $"Col2"(idx) as "Col" + (idx + 2)):_*))
The size of a Spark array column is not fixed, you could for instance have:
+----+------------+
|Col1| Col2|
+----+------------+
| 1| [2, 3, 4]|
| 1|[2, 3, 4, 5]|
+----+------------+
So there is no way to get the amount of columns and create those. If you know the size is always the same, you can set nElements
like this:
val nElements = df.select("Col2").first.getList(0).size
Solution 2
Just to give the Pyspark version of sgvd's answer. If the array column is in Col2
, then this select statement will move the first nElements
of each array in Col2
to their own columns:
from pyspark.sql import functions as F
df.select([F.col('Col2').getItem(i) for i in range(nElements)])
Solution 3
Just add on to sgvd's solution:
If the size is not always the same, you can set nElements like this:
val nElements = df.select(size('Col2).as("Col2_count"))
.select(max("Col2_count"))
.first.getInt(0)
Comments
-
Jorge Machado almost 2 years
After:
val df = Seq((1, Vector(2, 3, 4)), (1, Vector(2, 3, 4))).toDF("Col1", "Col2")
I have this DataFrame in Apache Spark:
+------+---------+ | Col1 | Col2 | +------+---------+ | 1 |[2, 3, 4]| | 1 |[2, 3, 4]| +------+---------+
How do I convert this into:
+------+------+------+------+ | Col1 | Col2 | Col3 | Col4 | +------+------+------+------+ | 1 | 2 | 3 | 4 | | 1 | 2 | 3 | 4 | +------+------+------+------+
-
Jorge Machado almost 8 yearsWhat if I don't know how many colums I have on the array ? I would Like to have something like explodeToColums
-
Mohammad Adnan over 7 yearsNice answer. is there way to use advance DataSet APIs to achieve above in more type safe way? For e.g. how can I get rid of $"Col1"?
-
sgvd over 7 yearsWhat do you mean exactly? I guess not getting rid of it by just not adding it to the
select
:) I guess you don't want to name it explicitly? But I'm not sure how you see type safety coming into play necessarily. It is probably worth it to create a new question with a minimal example of what you want to do. -
Mohammad Adnan over 7 yearssure i'll add another question. but what i mean is DataSet of spark provides good compile time safety. Where as $"col1" would be evaluated at run time. So if col1 is not present in dataset I get compile time error in select statement rather than getting at runtime.
-
user422930 over 5 yearshow would you add "Col1" to the select statement?
-
Shane Halloran over 5 years@user422930, thank-you for your question, sorry I only saw it now. Try doing something like:
from pyspark.sql import functions as F df.select([F.col('Col1')]+[F.col('Col2').getItem(i) for i in range(nElements)])
I haven't tested it, but let me know if it works or not.