PySpark: Add a new column with a tuple created from columns

python apache-spark pyspark apache-spark-sql spark-dataframe

13,039

Solution 1

I'm coming from scala but I do believe that there's a similar way in python :

Using sql.functions package mehtod :

If you want to get a StructType with this three column use the struct(cols: Column*): Column method like this :

from pyspark.sql.functions import struct
df.withColumn("V_tuple",struct(df.V1,df.V2,df.V3))

but if you want to get it as a String you can use the concat(exprs: Column*): Column method like this :

from pyspark.sql.functions import concat
df.withColumn("V_tuple",concat(df.V1,df.V2,df.V3))

With this second method you may have to cast the columns into Strings

I'm not sure about the python syntax, Just edit the answer if there's a syntax error.

Hope this help you. Best Regards

Solution 2

Use struct:

from pyspark.sql.functions import struct
df.withColumn("V_tuple", struct(df.V1,df.V2,df.V3))

13,039

Author by

Yuehan Lyu

Mathematics. Probability_Theory. Stochastic_Process. Statistics. Data_Science.

Updated on June 11, 2022

Comments

Yuehan Lyu 4 months

Here I have a dateframe created as follow,

df = spark.createDataFrame([('a',5,'R','X'),('b',7,'G','S'),('c',8,'G','S')], 
                       ["Id","V1","V2","V3"])

It looks like

+---+---+---+---+
| Id| V1| V2| V3|
+---+---+---+---+
|  a|  5|  R|  X|
|  b|  7|  G|  S|
|  c|  8|  G|  S|
+---+---+---+---+

I'm looking to add a column that is a tuple consisting of V1,V2,V3.

The result should look like

+---+---+---+---+-------+
| Id| V1| V2| V3|V_tuple|
+---+---+---+---+-------+
|  a|  5|  R|  X|(5,R,X)|
|  b|  7|  G|  S|(7,G,S)|
|  c|  8|  G|  S|(8,G,S)|
+---+---+---+---+-------+

I've tried to use similar syntex as in Python but it didn't work:

df.withColumn("V_tuple",list(zip(df.V1,df.V2,df.V3)))

TypeError: zip argument #1 must support iteration.

Any help would be appreciated!

Recents

Why am I getting some extra, weird characters when making a file from grep output?

Unix to verify file has no content and empty lines

BASH: can grep on command line, but not in script

Safari on iPad occasionally doesn't recognize ASP.NET postback links

anchor tag not working in safari (ios) for iPhone/iPod Touch/iPad

Logging SOAP request and response on server side

No value at JSON path "$.name", exception: json can not be null or empty, Using Mockmvc and Spring-boot

What is the difference between @RequestMapping and @PostMapping

enzyme simulate submit form, Cannot read property 'value' of undefined

How to log an entire request (headers, body, etc) for a certain url?

Jest Equality Matcher For Strings That Disregards Whitespace

rails 4 redirect back with new params

Pyspark: TaskMemoryManager: Failed to allocate a page: Need help in Error Analysis

convert dataframe to libsvm format

multi-processing with spark(PySpark)

How to use correlation in Spark with Dataframes?

How to use a Scala class inside Pyspark

PySpark: How to check if list of string values exists in dataframe and print values to a list

Column is not iterable in pySpark

pyspark.sql.utils.IllegalArgumentException: u'Field "features" does not exist.'

PySpark: TypeError: StructType can not accept object 0.10000000000000001 in type <type 'numpy.float64'>

How to filter a python Spark DataFrame by date between two date format columns