How to filter a python Spark DataFrame by date between two date format columns

15,881

If you have a working SQL query, you can always register your DataFrame as a temp table and use spark.sql():

df.createOrReplaceTempView("MYTABLE")
spark.sql("SELECT * FROM MYTABLE WHERE '2018-12-31' BETWEEN start_dt AND end_dt").show()
#+-------+----------+----------+
#|ColumnA|  START_DT|    END_DT|
#+-------+----------+----------+
#|      1|2016-01-01|2020-02-04|
#+-------+----------+----------+

Another option is to pass an expression to where:

df.where("'2018-12-31' BETWEEN start_dt AND end_dt").show()
#+-------+----------+----------+
#|ColumnA|  START_DT|    END_DT|
#+-------+----------+----------+
#|      1|2016-01-01|2020-02-04|
#+-------+----------+----------+

One more way is to use pyspark.sql.Column.between with pyspark.sql.functions.lit, but you'll have to use pyspark.sql.functions.expr in order to use a column value as a parameter.

from pyspark.sql.functions import lit, expr

test_date = "2018-12-31"
df.where(lit(test_date).between(expr('start_dt'), expr('end_dt'))).show()
#+-------+----------+----------+
#|ColumnA|  START_DT|    END_DT|
#+-------+----------+----------+
#|      1|2016-01-01|2020-02-04|
#+-------+----------+----------+

Lastly, you can implement your own version of between:

from pyspark.sql.functions import col

df.where((col("start_dt") <= lit(test_date)) & (col("end_dt") >= lit(test_date))).show()
#+-------+----------+----------+
#|ColumnA|  START_DT|    END_DT|
#+-------+----------+----------+
#|      1|2016-01-01|2020-02-04|
#+-------+----------+----------+
Share:
15,881

Related videos on Youtube

Joaquin
Author by

Joaquin

Physics student (Master Degree) in Universidad de Buenos Aires. Coding in R and Cypher/Neo4j for my thesis: "Physical-statistical Analysis of Organization Patterns in Complex Systems, Complex Networks and Knowledge Extraction in Biology Systems".

Updated on June 04, 2022

Comments

  • Joaquin
    Joaquin almost 2 years

    I'm using pyspark 2.1 and i have a dataframe with two columns with date format like this:

    Column A ,  START_DT       ,  END_DT
    1        ,  2016-01-01     ,  2020-02-04
    16       ,  2017-02-23     ,  2017-12-24
    

    I want to filter for a certain date (for example 2018-12-31) between the date from START_DT and END_DT (in the example there, the second row would be filtered).

    Both START_DT and END_DT columns are already in date format, i was looking for a method like the sql:

    SELECT *
    FROM  MYTABLE  WHERE  '2018-12-31' BETWEEN start_dt AND end_dt
    
  • Ajak6
    Ajak6 almost 3 years
    Does it need the column to be of type column or StringType will work as well?