Apply custom function to cells of selected columns of a data frame in PySpark
10,072
You have to use udf (user defined function) in spark
from pyspark.sql.functions import udf
example_udf = udf(example, LongType())
df.withColumn('result', example_udf(df.address1, df.address2))
Comments
-
Angie almost 2 years
Let's say I have a data frame which looks like this:
+---+-----------+-----------+ | id| address1| address2| +---+-----------+-----------+ | 1|address 1.1|address 1.2| | 2|address 2.1|address 2.2| +---+-----------+-----------+
I would like to apply a custom function directly to the strings in the address1 and address2 columns, for example:
def example(string1, string2): name_1 = string1.lower().split(' ') name_2 = string2.lower().split(' ') intersection_count = len(set(name_1) & set(name_2)) return intersection_count
I want to store the result in a new column, so that my final data frame would look like:
+---+-----------+-----------+------+ | id| address1| address2|result| +---+-----------+-----------+------+ | 1|address 1.1|address 1.2| 2| | 2|address 2.1|address 2.2| 7| +---+-----------+-----------+------+
I've tried to execute it in a way I once applied a built-in function to the whole column, but I got an error:
>>> df.withColumn('result', example(df.address1, df.address2)) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 2, in example TypeError: 'Column' object is not callable
What am I doing wrong and how I can apply a custom function to strings in selected columns?