How to pass a constant value to Python UDF?
Everything that is passed to an UDF is interpreted as a column / column name. If you want to pass a literal you have two options:
-
Pass argument using currying:
def comparatorUDF(n): return udf(lambda c: c == n, BooleanType()) df.where(comparatorUDF("Bonsanto")(col("name")))
This can be used with an argument of any type as long as it is serializable.
-
Use a SQL literal and the current implementation:
from pyspark.sql.functions import lit df.where(comparatorUDF(col("name"), lit("Bonsanto")))
This works only with supported types (strings, numerics, booleans). For non-atomic types see How to add a constant column in a Spark DataFrame?
Related videos on Youtube
Alberto Bonsanto
Computer Scientist with a passion for Artificial Intelligence and Machine Learning. I have a background in Software Development, Big Data and Distributed Computing and +4 years of experience in the field. I hold Bachelor's degrees in computer engineering and electrical engineering, and currently work as a Data Scientist (ML Engineer) at Rappi -- one of the highest valued tech company in Latin America.
Updated on July 13, 2022Comments
-
Alberto Bonsanto over 1 year
I was thinking if it was possible to create an
UDF
that receives two arguments aColumn
and another variable (Object
,Dictionary
, or any other type), then do some operations and return the result.Actually, I attempted to do this but I got an exception. Therefore, I was wondering if there was any way to avoid this problem.
df = sqlContext.createDataFrame([("Bonsanto", 20, 2000.00), ("Hayek", 60, 3000.00), ("Mises", 60, 1000.0)], ["name", "age", "balance"]) comparatorUDF = udf(lambda c, n: c == n, BooleanType()) df.where(comparatorUDF(col("name"), "Bonsanto")).show()
And I get the following error:
AnalysisException: u"cannot resolve 'Bonsanto' given input columns name, age, balance;"
So it's obvious that the
UDF
"sees" thestring
"Bonsanto" as a column name, and actually I'm trying to compare a record value with the second argument.On the other hand, I know that it's possible to use some operators inside a
where
clause (but actually I want to know if it is achievable using anUDF
), as follows:df.where(col("name") == "Bonsanto").show() #+--------+---+-------+ #| name|age|balance| #+--------+---+-------+ #|Bonsanto| 20| 2000.0| #+--------+---+-------+