Calculate time between two dates in pyspark

16,065

Solution 1

OK, figured it out

from pyspark.sql.types import *
import pyspark.sql.functions as funcs
import datetime
today = datetime.date(2017,2,15)

schema = StructType([StructField("foo", DateType(), True)])
l = [(datetime.date(2017,2,14),)]
df = sqlContext.createDataFrame(l, schema)
df = df.withColumn('daysBetween',funcs.datediff(funcs.lit(today), df.foo))
df.collect()

returns [Row(foo=datetime.date(2017, 2, 14), daysBetween=1)]

Solution 2

You can simply do the following:

import pyspark.sql.functions as F

df = df.withColumn('daysSince', F.datediff(F.current_date(), df.foo))
Share:
16,065
jamiet
Author by

jamiet

Updated on July 09, 2022

Comments

  • jamiet
    jamiet over 1 year

    Hoping this is fairly elementary. I have a Spark dataframe containing a Date column, I want to add a new column with number of days since that date. Google fu is failing me.

    Here's what I've tried:

    from pyspark.sql.types import *
    import datetime
    today = datetime.date.today()
    
    schema = StructType([StructField("foo", DateType(), True)])
    l = [(datetime.date(2016,12,1),)]
    df = sqlContext.createDataFrame(l, schema)
    df = df.withColumn('daysBetween',today - df.foo)
    df.show()
    

    it fails with error:

    u"cannot resolve '(17212 - foo)' due to data type mismatch: '(17212 - foo)' requires (numeric or calendarinterval) type, not date;"

    I've tried fiddling around but gotten nowhere. I can't think that this is too hard. Can anyone help?

  • gabra
    gabra almost 6 years
    So others can know: the differences are in days spark.apache.org/docs/2.1.0/api/python/…