How to explode multiple columns of a dataframe in pyspark

27,657

Solution 1

This works,

import pyspark.sql.functions as F
from pyspark.sql.types import *

df = sql.createDataFrame(
    [(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
    ['Name','Age','Subjects', 'Grades'])
df.show()

+-----+----+--------------------+---------+
| Name| Age|            Subjects|   Grades|
+-----+----+--------------------+---------+
|[Bob]|[16]|[Maths, Physics, ...|[A, B, C]|
+-----+----+--------------------+---------+

Use udf with zip. Those columns needed to explode have to be merged before exploding.

combine = F.udf(lambda x, y: list(zip(x, y)),
              ArrayType(StructType([StructField("subs", StringType()),
                                    StructField("grades", StringType())])))

df = df.withColumn("new", combine("Subjects", "Grades"))\
       .withColumn("new", F.explode("new"))\
       .select("Name", "Age", F.col("new.subs").alias("Subjects"), F.col("new.grades").alias("Grades"))
df.show()


+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]|    Maths|     A|
|[Bob]|[16]|  Physics|     B|
|[Bob]|[16]|Chemistry|     C|
+-----+----+---------+------+

Solution 2

PySpark has added an arrays_zip function in 2.4, which eliminates the need for a Python UDF to zip the arrays.

import pyspark.sql.functions as F
from pyspark.sql.types import *

df = sql.createDataFrame(
    [(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
    ['Name','Age','Subjects', 'Grades'])
df = df.withColumn("new", F.arrays_zip("Subjects", "Grades"))\
       .withColumn("new", F.explode("new"))\
       .select("Name", "Age", F.col("new.Subjects").alias("Subjects"), F.col("new.Grades").alias("Grades"))
df.show()

+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]|    Maths|     A|
|[Bob]|[16]|  Physics|     B|
|[Bob]|[16]|Chemistry|     C|
+-----+----+---------+------+

Solution 3

Arriving late to the party :-)

The simplest way to go is by using inline that doesn't have python API but is supported by selectExpr.

df.selectExpr('Name[0] as Name','Age[0] as Age','inline(arrays_zip(Subjects,Grades))').show()

+----+---+---------+------+
|Name|Age| Subjects|Grades|
+----+---+---------+------+
| Bob| 16|    Maths|     A|
| Bob| 16|  Physics|     B|
| Bob| 16|Chemistry|     C|
+----+---+---------+------+

Solution 4

Have you tried this

df.select(explode(split(col("Subjects"))).alias("Subjects")).show()

you can convert the data frame to an RDD.

For an RDD you can use a flatMap function to separate the Subjects.

Share:
27,657
Visualisation App
Author by

Visualisation App

Updated on July 09, 2022

Comments

  • Visualisation App
    Visualisation App almost 2 years

    I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same.

    Name  Age  Subjects                  Grades
    [Bob] [16] [Maths,Physics,Chemistry] [A,B,C]
    

    I want to explode the dataframe in such a way that i get the following output-

    Name Age Subjects Grades
    Bob  16   Maths     A
    Bob  16  Physics    B
    Bob  16  Chemistry  C
    

    How can I achieve this?

  • Visualisation App
    Visualisation App almost 6 years
    I've tried using a flat map as df.rdd.flatMap(lambda x: zip(*[x[c] for c in dcols])).toDF(dcols) but it is only giving me the first row and ignoring the remaining rows- |16 |A |Bob |Maths |
  • Naveen Srikanth
    Naveen Srikanth over 5 years
    what should I do if I need to put A B and C in different columns rather than rows
  • ARCrow
    ARCrow about 2 years
    UDFs are not the efficient and performant. They should be avoided if a pyspark API solution exists.