How to create a udf in PySpark which returns an array of strings?

41,420

You need to initialize a StringType instance:

label_udf = udf(my_udf, ArrayType(StringType()))
#                                           ^^ 
df.withColumn('subset', label_udf(df.col1)).show()
+------------+------+
|        col1|subset|
+------------+------+
|     oculunt|[s, n]|
|predistposed|[s, n]|
| incredulous|[s, n]|
+------------+------+
Share:
41,420
makansij
Author by

makansij

I'm a PhD Student at University of Southern California.

Updated on January 10, 2020

Comments

  • makansij
    makansij over 4 years

    I have a udf which returns a list of strings. this should not be too hard. I pass in the datatype when executing the udf since it returns an array of strings: ArrayType(StringType).

    Now, somehow this is not working:

    the dataframe i'm operating on is df_subsets_concat and looks like this:

    df_subsets_concat.show(3,False)
    
    +----------------------+
    |col1                  |
    +----------------------+
    |oculunt               |
    |predistposed          |
    |incredulous           |
    +----------------------+
    only showing top 3 rows
    

    and the code is

    from pyspark.sql.types import ArrayType, FloatType, StringType
    
    my_udf = lambda domain: ['s','n']
    label_udf = udf(my_udf, ArrayType(StringType))
    df_subsets_concat_with_md = df_subsets_concat.withColumn('subset', label_udf(df_subsets_concat.col1))
    

    and the result is

    /usr/lib/spark/python/pyspark/sql/types.py in __init__(self, elementType, containsNull)
        288         False
        289         """
    --> 290         assert isinstance(elementType, DataType), "elementType should be DataType"
        291         self.elementType = elementType
        292         self.containsNull = containsNull
    
    AssertionError: elementType should be DataType
    

    It is my understanding that this was the correct way to do this. Here are some resources: pySpark Data Frames "assert isinstance(dataType, DataType), "dataType should be DataType" How to return a "Tuple type" in a UDF in PySpark?

    But neither of these have helped me resolve why this is not working. i am using pyspark 1.6.1.

    How to create a udf in pyspark which returns an array of strings?