Calculate time between two dates in pyspark
16,065
Solution 1
OK, figured it out
from pyspark.sql.types import *
import pyspark.sql.functions as funcs
import datetime
today = datetime.date(2017,2,15)
schema = StructType([StructField("foo", DateType(), True)])
l = [(datetime.date(2017,2,14),)]
df = sqlContext.createDataFrame(l, schema)
df = df.withColumn('daysBetween',funcs.datediff(funcs.lit(today), df.foo))
df.collect()
returns [Row(foo=datetime.date(2017, 2, 14), daysBetween=1)]
Solution 2
You can simply do the following:
import pyspark.sql.functions as F
df = df.withColumn('daysSince', F.datediff(F.current_date(), df.foo))
Author by
jamiet
Updated on July 09, 2022Comments
-
jamiet over 1 year
Hoping this is fairly elementary. I have a Spark dataframe containing a Date column, I want to add a new column with number of days since that date. Google fu is failing me.
Here's what I've tried:
from pyspark.sql.types import * import datetime today = datetime.date.today() schema = StructType([StructField("foo", DateType(), True)]) l = [(datetime.date(2016,12,1),)] df = sqlContext.createDataFrame(l, schema) df = df.withColumn('daysBetween',today - df.foo) df.show()
it fails with error:
u"cannot resolve '(17212 - foo)' due to data type mismatch: '(17212 - foo)' requires (numeric or calendarinterval) type, not date;"
I've tried fiddling around but gotten nowhere. I can't think that this is too hard. Can anyone help?
-
gabra almost 6 yearsSo others can know: the differences are in days spark.apache.org/docs/2.1.0/api/python/…