Pyspark: Using UDF with argument(s) to create a new column

15,768

Solution 1

I have found three options for achieving this:

Setup reproducible example

import pandas as pd 
import datetime 

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

from pyspark.sql.types import DateType
from pyspark.sql.functions import expr, lit

sc = SparkContext.getOrCreate()
spark = SparkSession(sc) 

def to_date_formatted(date_str, format):
    if date_str == '' or date_str is None:
        return None
    try:
        dt = datetime.datetime.strptime(date_str, format)
    except:
        return None
    return dt.date()

data = {}
data["date_str_1"] = ["01-Dec-17", "05-Jan-12", "08-Mar-15"]
data["date_str_2"] = ["01/12/17", "05/01/12", "08/03/15"]

df = pd.DataFrame(data)
df = spark.createDataFrame(df)
df.registerTempTable('df')

Option 1

from pyspark.sql.functions import udf
to_date_udf = udf(to_date_formatted, DateType())
df = df.withColumn("parsed_date", to_date_udf('date_str_1', lit('%d-%b-%y')))
df.show()

Option 2

spark.udf.register("to_date_udf", to_date_formatted, DateType())
ex = "to_date_udf(date_str_1, '%d-%b-%y') as d"
df = df.withColumn("parsed_date", expr(ex))

df.show()

Option 3

Option 3 is just to curry the to_date_formatted function:

from functools import partial
curried_to_date = partial(to_date_formatted, format="%d-%b-%y")

curried_to_date = udf(curried_to_date, DateType())
df.withColumn("parsed_date", curried_to_date('date_str_1'))

Solution 2

Just use to_date:

from pyspark.sql.functions import to_date

df.withColumn("date_str_1_", to_date("date_str_1", "dd-MMM-yy")).show()
# +----------+----------+-----------+
# |date_str_1|date_str_2|date_str_1_|
# +----------+----------+-----------+
# | 01-Dec-17|  01/12/17| 2017-12-01|
# | 05-Jan-12|  05/01/12| 2012-01-05|
# | 08-Mar-15|  08/03/15| 2015-03-08|
# +----------+----------+-----------+

df.withColumn("date_str_2_", to_date("date_str_2", "dd/MM/yy")).show()
# +----------+----------+-----------+
# |date_str_1|date_str_2|date_str_2_|
# +----------+----------+-----------+
# | 01-Dec-17|  01/12/17| 2017-12-01|
# | 05-Jan-12|  05/01/12| 2012-01-05|
# | 08-Mar-15|  08/03/15| 2015-03-08|
# +----------+----------+-----------+
Share:
15,768
RobinL
Author by

RobinL

Data scientist/engineer for UK government

Updated on June 19, 2022

Comments

  • RobinL
    RobinL almost 2 years

    I have a user defined function as follows which I want to use to derive new columns in my dataframe:

    def to_date_formatted(date_str, format):
        if date_str == '' or date_str is None:
            return None
        try:
            dt = datetime.datetime.strptime(date_str, format)
        except:
            return None
        return dt.date()
    
    spark.udf.register("to_date_udf", to_date_formatted, DateType())
    

    I can use this by running sql like select to_date_udf(my_date, '%d-%b-%y') as date. Note the ability to pass a custom format as an argument to the function

    However, I'm struggling to use it using pyspark column expression syntax, rather than sql

    I want to write something like:

    df.with_column("date", to_date_udf('my_date', %d-%b-%y')
    

    But this results in an error. How can I do this?

    [Edit: In this specific example, in Spark 2.2+ you can provide an optional format argument with the built in to_date function. I'm on Spark 2.0 at the moment, so this is not possible for me. Also worth noting I provided this as an example, but I'm interested in the general syntax for providing arguments to UDFs, rather than the specifics of date conversion]

    • RobinL
      RobinL about 6 years
      Thank you - the answer to that question is useful and confirms my findings. I spent a of time Google trying to find an answer for this one, but I was searching more broadly for thing like 'pyspark udf arguments', and the title of that other question only indirectly relates to this.
  • RobinL
    RobinL about 6 years
    Thank you. I'm on Spark 2.0, and unfortunately to_date didn't take a date formater until Spark 2.2 - until then, the string had to be in iso format
  • Alper t. Turker
    Alper t. Turker about 6 years