Pyspark: Difference between two Dates (Cast TimestampType, Datediff)

22,679

This answer is basically a copy of https://stackoverflow.com/a/36985244/4219202 In your case the timeFmt would be "dd-MM-yyyy" for the columns first_booking_date_clean and today

As of Spark 1.5 you can use unix_timestamp:

from pyspark.sql import functions as F
timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"
timeDiff = (F.unix_timestamp('EndDateTime', format=timeFmt)
            - F.unix_timestamp('StartDateTime', format=timeFmt))
df = df.withColumn("Duration", timeDiff)
Share:
22,679
Niklas Brauer
Author by

Niklas Brauer

Updated on July 01, 2020

Comments

  • Niklas Brauer
    Niklas Brauer almost 4 years

    There is a table with incidents and a specific timestamp. I struggle to calculate the number of days passed using the Pyspark 2.0 API. I managed to do the same thing when the timestamp followed another format (yyyy-mm-dd)

        +-------------------+------------------------+----------+--------------+
         | first_booking_date|first_booking_date_clean|  today   |customer_since|
        +-------------------+------------------------+----------+--------------+
         |02-06-2011 20:52:04|              02-06-2011|02-06-2011|          null|
         |03-06-2004 18:15:10|              03-06-2004|02-06-2011|          null|
    

    I tried the following (nothing worked): - extract date with string manipulation and use datediff - cast to timestamp and then extract dd:MM:yy (->result null) - I prefer to use pyspark commands over any additional transformation with sql

    Help is highly appreciated, Best and thanks a lot!!!

    EDIT: Here is an example that did not work:

    import datetime
    today = datetime.date(2011,2,1)
    today = "02-06-2011"
    first_bookings = first_bookings.withColumn("today",F.lit(today))
    first_bookings = first_bookings.withColumn("first_booking_date_clean",F.substring(first_bookings.first_booking_date, 0, 10))
    first_bookings = first_bookings.withColumn("customer_since",F.datediff(first_bookings.today,first_bookings.first_booking_date_clean))