PySpark: Add a new column with a tuple created from columns

13,039

Solution 1

I'm coming from scala but I do believe that there's a similar way in python :

Using sql.functions package mehtod :

If you want to get a StructType with this three column use the struct(cols: Column*): Column method like this :

from pyspark.sql.functions import struct
df.withColumn("V_tuple",struct(df.V1,df.V2,df.V3))

but if you want to get it as a String you can use the concat(exprs: Column*): Column method like this :

from pyspark.sql.functions import concat
df.withColumn("V_tuple",concat(df.V1,df.V2,df.V3))

With this second method you may have to cast the columns into Strings

I'm not sure about the python syntax, Just edit the answer if there's a syntax error.

Hope this help you. Best Regards

Solution 2

Use struct:

from pyspark.sql.functions import struct

df.withColumn("V_tuple", struct(df.V1,df.V2,df.V3))
Share:
13,039
Yuehan Lyu
Author by

Yuehan Lyu

Mathematics. Probability_Theory. Stochastic_Process. Statistics. Data_Science.

Updated on June 11, 2022

Comments

  • Yuehan Lyu
    Yuehan Lyu almost 2 years

    Here I have a dateframe created as follow,

    df = spark.createDataFrame([('a',5,'R','X'),('b',7,'G','S'),('c',8,'G','S')], 
                           ["Id","V1","V2","V3"])
    

    It looks like

    +---+---+---+---+
    | Id| V1| V2| V3|
    +---+---+---+---+
    |  a|  5|  R|  X|
    |  b|  7|  G|  S|
    |  c|  8|  G|  S|
    +---+---+---+---+
    

    I'm looking to add a column that is a tuple consisting of V1,V2,V3.

    The result should look like

    +---+---+---+---+-------+
    | Id| V1| V2| V3|V_tuple|
    +---+---+---+---+-------+
    |  a|  5|  R|  X|(5,R,X)|
    |  b|  7|  G|  S|(7,G,S)|
    |  c|  8|  G|  S|(8,G,S)|
    +---+---+---+---+-------+
    

    I've tried to use similar syntex as in Python but it didn't work:

    df.withColumn("V_tuple",list(zip(df.V1,df.V2,df.V3)))
    

    TypeError: zip argument #1 must support iteration.

    Any help would be appreciated!