How to pass a constant value to Python UDF?

11,448

Everything that is passed to an UDF is interpreted as a column / column name. If you want to pass a literal you have two options:

  1. Pass argument using currying:

    def comparatorUDF(n):
        return udf(lambda c: c == n, BooleanType())
    
    df.where(comparatorUDF("Bonsanto")(col("name")))
    

    This can be used with an argument of any type as long as it is serializable.

  2. Use a SQL literal and the current implementation:

    from pyspark.sql.functions import lit
    
    df.where(comparatorUDF(col("name"), lit("Bonsanto")))
    

    This works only with supported types (strings, numerics, booleans). For non-atomic types see How to add a constant column in a Spark DataFrame?

Share:
11,448

Related videos on Youtube

Alberto Bonsanto
Author by

Alberto Bonsanto

Computer Scientist with a passion for Artificial Intelligence and Machine Learning. I have a background in Software Development, Big Data and Distributed Computing and +4 years of experience in the field. I hold Bachelor's degrees in computer engineering and electrical engineering, and currently work as a Data Scientist (ML Engineer) at Rappi -- one of the highest valued tech company in Latin America.

Updated on July 13, 2022

Comments

  • Alberto Bonsanto
    Alberto Bonsanto over 1 year

    I was thinking if it was possible to create an UDF that receives two arguments a Column and another variable (Object,Dictionary, or any other type), then do some operations and return the result.

    Actually, I attempted to do this but I got an exception. Therefore, I was wondering if there was any way to avoid this problem.

    df = sqlContext.createDataFrame([("Bonsanto", 20, 2000.00), 
                                     ("Hayek", 60, 3000.00), 
                                     ("Mises", 60, 1000.0)], 
                                    ["name", "age", "balance"])
    
    comparatorUDF = udf(lambda c, n: c == n, BooleanType())
    
    df.where(comparatorUDF(col("name"), "Bonsanto")).show()
    

    And I get the following error:

    AnalysisException: u"cannot resolve 'Bonsanto' given input columns name, age, balance;"

    So it's obvious that the UDF "sees" the string "Bonsanto" as a column name, and actually I'm trying to compare a record value with the second argument.

    On the other hand, I know that it's possible to use some operators inside a where clause (but actually I want to know if it is achievable using an UDF), as follows:

    df.where(col("name") == "Bonsanto").show()
    
    #+--------+---+-------+
    #|    name|age|balance|
    #+--------+---+-------+
    #|Bonsanto| 20| 2000.0|
    #+--------+---+-------+