Apply custom function to cells of selected columns of a data frame in PySpark

10,072

You have to use udf (user defined function) in spark

from pyspark.sql.functions import udf
example_udf = udf(example, LongType())
df.withColumn('result', example_udf(df.address1, df.address2))
Share:
10,072
Angie
Author by

Angie

I'm a strong, linearly independent woman.

Updated on July 01, 2022

Comments

  • Angie
    Angie almost 2 years

    Let's say I have a data frame which looks like this:

    +---+-----------+-----------+
    | id|   address1|   address2|
    +---+-----------+-----------+
    |  1|address 1.1|address 1.2|
    |  2|address 2.1|address 2.2|
    +---+-----------+-----------+
    

    I would like to apply a custom function directly to the strings in the address1 and address2 columns, for example:

    def example(string1, string2):
        name_1 = string1.lower().split(' ')
        name_2 = string2.lower().split(' ')
        intersection_count = len(set(name_1) & set(name_2))
    
        return intersection_count
    

    I want to store the result in a new column, so that my final data frame would look like:

    +---+-----------+-----------+------+
    | id|   address1|   address2|result|
    +---+-----------+-----------+------+
    |  1|address 1.1|address 1.2|     2|
    |  2|address 2.1|address 2.2|     7|
    +---+-----------+-----------+------+
    

    I've tried to execute it in a way I once applied a built-in function to the whole column, but I got an error:

    >>> df.withColumn('result', example(df.address1, df.address2))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<stdin>", line 2, in example
    TypeError: 'Column' object is not callable
    

    What am I doing wrong and how I can apply a custom function to strings in selected columns?