TimeStampType in Pyspark with datetime tzaware objects

12,092

TimestampType in pyspark is not tz aware like in Pandas rather it passes long ints and displays them according to your machine's local time zone (by default).

That being said, you can change your spark session time zone, using 'spark.sql.session.timeZone'

from datetime import datetime
from dateutil import tz
from pyspark.sql import Row

utc_now = datetime.now().replace(tzinfo=tz.tzutc())
print(utc_now)

spark.conf.set('spark.sql.session.timeZone', 'Europe/Paris')
data_df = spark.createDataFrame([Row(date=utc_now)])
data_df.show(10, False)
print(data_df.collect())

    2018-02-12 20:41:16.270386+00:00
    +--------------------------+
    |date                      |
    +--------------------------+
    |2018-02-12 21:41:16.270386|
    +--------------------------+

    [Row(date=datetime.datetime(2018, 2, 12, 21, 41, 16, 270386))]


spark.conf.set('spark.sql.session.timeZone', 'UTC')
data_df2 = spark.createDataFrame([Row(date=utc_now)])
data_df2.show(10, False)
print(data_df2.collect())

    +--------------------------+
    |date                      |
    +--------------------------+
    |2018-02-12 20:41:16.270386|
    +--------------------------+

    [Row(date=datetime.datetime(2018, 2, 12, 21, 41, 16, 270386))]

As you can see Spark considers it as UTC but serves it back in the local timezone since Python still has time zone 'Europe/Paris'

import os, time
os.environ['TZ'] = 'UTC'
time.tzset()
utc_now = datetime.now()
spark.conf.set('spark.sql.session.timeZone', 'UTC')
data_df2 = spark.createDataFrame([Row(date=utc_now)])
data_df2.show(10, False)
print(data_df2.collect())

    +--------------------------+
    |date                      |
    +--------------------------+
    |2018-02-12 20:41:16.807757|
    +--------------------------+

    [Row(date=datetime.datetime(2018, 2, 12, 20, 41, 16, 807757))]

Moreover, pyspark.sql.module provides you with two functions to convert a timestamp object to another one corresponding to the same time of day (from_utc_timesamp, to_utc_timestamp). Although I don't think you want to alter your datetimes.

Share:
12,092
Apostolos
Author by

Apostolos

Updated on June 24, 2022

Comments

  • Apostolos
    Apostolos almost 2 years

    I have the following issue I cannot fully understand in Pyspark. I have the following datetime objects

    utc_now = datetime.now().replace(tzinfo=tz.tzutc())
    utc_now # datetime.datetime(2018, 2, 12, 13, 9, 52, 785007, tzinfo=tzutc())
    

    and I create a spark DataFrame

    data_df = spark.createDataFrame([Row(date=utc_now)])
    

    when I try to show the dataframe

    data_df.show(10, False)
    

    the column containing the data is in local time that is 2 hours front

    >>> data_df.show(10, False)
    +--------------------------+
    |date                      |
    +--------------------------+
    |2018-02-12 15:09:52.785007|
    +--------------------------+
    

    and collecting the data has shifted time in the datetime object two hours front

    >>> data_df.collect()
    [Row(date=datetime.datetime(2018, 2, 12, 15, 9, 52, 785007))]
    

    Zone info is also removed. Can this behavior be altered when casting to TimestampType?

  • Apostolos
    Apostolos about 6 years
    Where did you find this information about the conf strings? I couldn't find them anywhere in the documentation.
  • MaFF
    MaFF about 6 years
    It's not documented but there is a pull request for it issues.apache.org/jira/browse/SPARK-18936. I needed it to properly consider data without daylight savings, and there were a few thread on SO setting this configuration namely stackoverflow.com/questions/45434538/…