pyspark - merge 2 columns of sets

11,291

Solution 1

Let's say df has

+--------------------+--------------------+
|              Fruits|                Meat|
+--------------------+--------------------+
|[Pear, Orange, Ap...|[Chicken, Pork, B...|
+--------------------+--------------------+

then

import itertools
df.rdd.map(lambda x: [item for item in itertools.chain(x.Fruits, x.Meat)]).collect()

creates a set of Fruits & Meat combined into one set i.e.

[[u'Pear', u'Orange', u'Apple', u'Chicken', u'Pork', u'Beef']]


Hope this helps!

Solution 2

I was also figuring this out in Python, so here is a port of Ramesh's solution to Python:

df = spark.createDataFrame([(['Pear','Orange','Apple'], ['Chicken','Pork','Beef'])],
                           ("Fruits", "Meat"))
df.show(1,False)

from pyspark.sql.functions import udf
mergeCols = udf(lambda fruits, meat: fruits + meat)
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(1,False)

Output:

+---------------------+---------------------+
|Fruits               |Meat                 |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
+---------------------+---------------------+------------------------------------------+
|Fruits               |Meat                 |Food                                      |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+

Kudos to Ramesh!


EDIT: Note that you might have to manually specify the column type (not sure why it worked for me only in some cases without explicit type specification - in other cases I was getting a string type column).

from pyspark.sql.types import *
mergeCols = udf(lambda fruits, meat: fruits + meat, ArrayType(StringType()))

Solution 3

Given that you have dataframe as

+---------------------+---------------------+
|Fruits               |Meat                 |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+

You can write a udf function to merge the sets of two columns into one.

import org.apache.spark.sql.functions._
def mergeCols = udf((fruits: mutable.WrappedArray[String], meat: mutable.WrappedArray[String]) => fruits ++ meat)

And then call the udf function as

df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(false)

You should have your desired final dataframe

+---------------------+---------------------+------------------------------------------+
|Fruits               |Meat                 |Food                                      |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+
Share:
11,291

Related videos on Youtube

soulless
Author by

soulless

Updated on July 05, 2022

Comments

  • soulless
    soulless over 1 year

    I have a spark dataframe that has 2 columns formed from the function collect_set. I would like to combine these 2 columns of sets into 1 column of set. How should I do so? They are both set of strings

    For Instance I have 2 columns formed from calling collect_set

    Fruits                  |    Meat
    [Apple,Orange,Pear]          [Beef, Chicken, Pork]
    

    How do I turn it into:

    Food
    
    [Apple,Orange,Pear, Beef, Chicken, Pork]
    

    Thank you very much for your help in advance

    • Avishek Bhattacharya
      Avishek Bhattacharya over 6 years
      Please provide more information like structure of the dataframe with examples
  • soulless
    soulless over 6 years
    Is this with python? I can't seem to find mutable.WrappedArray
  • Ramesh Maharjan
    Ramesh Maharjan over 6 years
    this is all in scala :)