Pyspark: Difference between two Dates (Cast TimestampType, Datediff)
This answer is basically a copy of https://stackoverflow.com/a/36985244/4219202 In your case the timeFmt would be "dd-MM-yyyy" for the columns first_booking_date_clean and today
As of Spark 1.5 you can use unix_timestamp:
from pyspark.sql import functions as F
timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"
timeDiff = (F.unix_timestamp('EndDateTime', format=timeFmt)
- F.unix_timestamp('StartDateTime', format=timeFmt))
df = df.withColumn("Duration", timeDiff)
Niklas Brauer
Updated on July 01, 2020Comments
-
Niklas Brauer almost 4 years
There is a table with incidents and a specific timestamp. I struggle to calculate the number of days passed using the Pyspark 2.0 API. I managed to do the same thing when the timestamp followed another format (yyyy-mm-dd)
+-------------------+------------------------+----------+--------------+ | first_booking_date|first_booking_date_clean| today |customer_since| +-------------------+------------------------+----------+--------------+ |02-06-2011 20:52:04| 02-06-2011|02-06-2011| null| |03-06-2004 18:15:10| 03-06-2004|02-06-2011| null|
I tried the following (nothing worked): - extract date with string manipulation and use datediff - cast to timestamp and then extract dd:MM:yy (->result null) - I prefer to use pyspark commands over any additional transformation with sql
Help is highly appreciated, Best and thanks a lot!!!
EDIT: Here is an example that did not work:
import datetime today = datetime.date(2011,2,1) today = "02-06-2011" first_bookings = first_bookings.withColumn("today",F.lit(today)) first_bookings = first_bookings.withColumn("first_booking_date_clean",F.substring(first_bookings.first_booking_date, 0, 10)) first_bookings = first_bookings.withColumn("customer_since",F.datediff(first_bookings.today,first_bookings.first_booking_date_clean))