How to create a udf in PySpark which returns an array of strings?
You need to initialize a StringType
instance:
label_udf = udf(my_udf, ArrayType(StringType()))
# ^^
df.withColumn('subset', label_udf(df.col1)).show()
+------------+------+
| col1|subset|
+------------+------+
| oculunt|[s, n]|
|predistposed|[s, n]|
| incredulous|[s, n]|
+------------+------+
makansij
I'm a PhD Student at University of Southern California.
Updated on January 10, 2020Comments
-
makansij over 4 years
I have a udf which returns a list of strings. this should not be too hard. I pass in the datatype when executing the udf since it returns an array of strings:
ArrayType(StringType)
.Now, somehow this is not working:
the dataframe i'm operating on is
df_subsets_concat
and looks like this:df_subsets_concat.show(3,False)
+----------------------+ |col1 | +----------------------+ |oculunt | |predistposed | |incredulous | +----------------------+ only showing top 3 rows
and the code is
from pyspark.sql.types import ArrayType, FloatType, StringType my_udf = lambda domain: ['s','n'] label_udf = udf(my_udf, ArrayType(StringType)) df_subsets_concat_with_md = df_subsets_concat.withColumn('subset', label_udf(df_subsets_concat.col1))
and the result is
/usr/lib/spark/python/pyspark/sql/types.py in __init__(self, elementType, containsNull) 288 False 289 """ --> 290 assert isinstance(elementType, DataType), "elementType should be DataType" 291 self.elementType = elementType 292 self.containsNull = containsNull AssertionError: elementType should be DataType
It is my understanding that this was the correct way to do this. Here are some resources: pySpark Data Frames "assert isinstance(dataType, DataType), "dataType should be DataType" How to return a "Tuple type" in a UDF in PySpark?
But neither of these have helped me resolve why this is not working. i am using pyspark 1.6.1.
How to create a udf in pyspark which returns an array of strings?