Date and Interval Addition in SparkSQL

16,035

Solution 1

This is probably a matter of conversions to local time. INTERVAL casts data to TIMESTAMP and then back to DATE:

scala> spark.sql("SELECT CAST('1997-09-19' AS DATE) + INTERVAL 1 weeks").explain
== Physical Plan ==
*Project [10130 AS CAST(CAST(CAST(1997-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)#19]
+- Scan OneRowRelation[]

(note the second and third CASTs) and Spark is known to be inconsequent when handling timestamps.

DATE_ADD should exhibit more stable behavior:

scala> spark.sql("SELECT DATE_ADD(CAST('1997-09-19' AS DATE), 7)").explain
== Physical Plan ==
*Project [10130 AS date_add(CAST(1997-09-19 AS DATE), 7)#27]
+- Scan OneRowRelation[]

Solution 2

As of Spark 3, this bug has been fixed. Let's create a DataFrame with the dates you mentioned and add a week interval. Create the DataFrame.

import java.sql.Date

val df = Seq(
  (Date.valueOf("1999-09-19")),
  (Date.valueOf("1997-09-19"))
).toDF("some_date")

Add a week interval:

df
  .withColumn("plus_one_week", expr("some_date + INTERVAL 1 week"))
  .show()
+----------+-------------+
| some_date|plus_one_week|
+----------+-------------+
|1999-09-19|   1999-09-26|
|1997-09-19|   1997-09-26|
+----------+-------------+

You can also get this same result with the make_interval() SQL function:

df
  .withColumn("plus_one_week", expr("some_date + make_interval(0, 0, 1, 0, 0, 0, 0)"))
  .show()

We're working on getting make_interval() exposed as Scala/PySpark functions, so it's not necessary to use expr to access the function.

date_add only works for adding days, so it's limited. make_interval() is a lot more powerful because it lets you add any combination of years / months / days / hours / minutes / seconds.

Share:
16,035

Related videos on Youtube

yakout
Author by

yakout

Updated on August 25, 2022

Comments

  • yakout
    yakout over 1 year

    I am trying to execute a simple SQL query on some dataframe in spark-shell the query adds interval of 1 week to some date as follows:

    The original query:

    scala> spark.sql("select Cast(table1.date2 as Date) + interval 1 week from table1").show()
    

    Now when I did some tests:

    scala> spark.sql("select Cast('1999-09-19' as Date) + interval 1 week from table1").show()
    

    I got the results correctly

    +----------------------------------------------------------------------------+
    |CAST(CAST(CAST(1999-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)|
    +----------------------------------------------------------------------------+
    |                                                                  1999-09-26|
    +----------------------------------------------------------------------------+
    

    (Just adding 7 days to 19 = 26)

    But when I just changed the year to 1997 instead of 1999 the results changed!

    scala> spark.sql("select Cast('1997-09-19' as Date) + interval 1 week from table1").show()
    
    +----------------------------------------------------------------------------+
    |CAST(CAST(CAST(1997-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)|
    +----------------------------------------------------------------------------+
    |                                                                  1997-09-25|
    +----------------------------------------------------------------------------+
    

    Why the reuslts changed? Shouldn't it be 26 not 25?

    So, is this a bug in sparkSQL related to some kind of itermediate calculations loss or I am missing something?

  • Rick Moritz
    Rick Moritz over 6 years
    Inconsistent too: If you have a cluster spanning two timezones, timestamp to date conversion completely falls apart (unless you use methods with explicit timezones every time).
  • Thiago Mata
    Thiago Mata almost 3 years
    I can confirm that the function makeInterval is available and working with Datasets in spark 3.0.1 makeInterval(years = 0, months = 0, weeks = 1, days = 0, hours = 0, mins = 0, secs = Decimal(0))