Convert date string to timestamp for sorting on sub-second precision

12,766

Solution 1

for Spark >= 2.2 an alternative approach to unix_timestamp() is by using to_timestamp():

from pyspark.sql.functions import col, to_timestamp

new_df = df.withColumn('converted_date',to_timestamp(col('date'), "yyyy-MM-dd'T'HH:mm:ss"))

Solution 2

Suppose you had the following DataFrame:

df = spark.createDataFrame(
    [('2018-02-01T13:13:12.023507', ), ('2018-02-01T13:13:12.323507', )], 
    ["date"]
)
df.show(truncate=False)
#+--------------------------+
#|date                      |
#+--------------------------+
#|2018-02-01T13:13:12.023507|
#|2018-02-01T13:13:12.323507|
#+--------------------------+

unixtimestamp only supports second precision. If you're only concerned with sorting based on the date, you can do the following:

from pyspark.sql.functions import col, unix_timestamp
df.withColumn(
    'new_date',
    unix_timestamp(col('date'), "yyyy-MM-dd'T'hh:mm:ss").cast("timestamp")
).sort('new_date').show(truncate=False)
#+--------------------------+---------------------+
#|date                      |new_date             |
#+--------------------------+---------------------+
#|2018-02-01T13:13:12.323507|2018-02-01 13:13:12.0|
#|2018-02-01T13:13:12.023507|2018-02-01 13:13:12.0|
#+--------------------------+---------------------+

But since these two example rows have the same date and time up to the second, the sorting here will be indeterminate.

If the sub-second portion is important to you, you can write your own function to handle that. One way is to split the date column on the . and divide by 1000000.0 to get the microseconds. Then add this to the unixtimestamp for sorting:

from pyspark.sql.functions import split

df.withColumn(
    'order_column',
    unix_timestamp('date', "yyyy-MM-dd'T'hh:mm:ss") + split('date', "\.")[1]/1000000.0
).sort("order_column").show(truncate=False)
#+--------------------------+-------------------+
#|date                      |order_column       |
#+--------------------------+-------------------+
#|2018-02-01T13:13:12.023507|1.517508792023507E9|
#|2018-02-01T13:13:12.323507|1.517508792323507E9|
#+--------------------------+-------------------+
Share:
12,766
Gerasimos
Author by

Gerasimos

Updated on June 28, 2022

Comments

  • Gerasimos
    Gerasimos almost 2 years

    I have a column date in a pySpark dataframe with dates in the following format:

    2018-02-01T13:13:12.023507
    

    I want to convert the dates in that column from string to timestamp (or something that I can sort it based on the date). So far I have tried the following:

    new_df = df.withColumn(
        'date', 
        unix_timestamp("date", "YYYY-MM-DD'T'hh:mm:ss.s").cast("double").cast("timestamp")
    )
    

    and this

    new_df = df.withColumn(
        'date',
        from_unixtime(
            unix_timestamp(col(('date')), "yyyy-MM-dd'T'hh:mm:ss.SSS"), 
            "yyyy-MM-dd'T'HH:mm:ss.SSS"
        )
    )
    

    and this

    df_new = df.withColumn(
        'date1',
        F.to_timestamp("date", "yyyy-dd-MM'T'hh:mm:ss.s")
    )
    

    I tried everything I found in other similar questions but so far nothing is working. I have also tried a different format yyyy-MM-dd'T'HH:mm:ss.ssssss with no success. What am I missing?

  • Triamus
    Triamus almost 5 years
    I think it should be from pyspark.sql.functions import col, unix_timestamp