How to filter a python Spark DataFrame by date between two date format columns
If you have a working SQL query, you can always register your DataFrame as a temp table and use spark.sql()
:
df.createOrReplaceTempView("MYTABLE")
spark.sql("SELECT * FROM MYTABLE WHERE '2018-12-31' BETWEEN start_dt AND end_dt").show()
#+-------+----------+----------+
#|ColumnA| START_DT| END_DT|
#+-------+----------+----------+
#| 1|2016-01-01|2020-02-04|
#+-------+----------+----------+
Another option is to pass an expression to where
:
df.where("'2018-12-31' BETWEEN start_dt AND end_dt").show()
#+-------+----------+----------+
#|ColumnA| START_DT| END_DT|
#+-------+----------+----------+
#| 1|2016-01-01|2020-02-04|
#+-------+----------+----------+
One more way is to use pyspark.sql.Column.between
with pyspark.sql.functions.lit
, but you'll have to use pyspark.sql.functions.expr
in order to use a column value as a parameter.
from pyspark.sql.functions import lit, expr
test_date = "2018-12-31"
df.where(lit(test_date).between(expr('start_dt'), expr('end_dt'))).show()
#+-------+----------+----------+
#|ColumnA| START_DT| END_DT|
#+-------+----------+----------+
#| 1|2016-01-01|2020-02-04|
#+-------+----------+----------+
Lastly, you can implement your own version of between
:
from pyspark.sql.functions import col
df.where((col("start_dt") <= lit(test_date)) & (col("end_dt") >= lit(test_date))).show()
#+-------+----------+----------+
#|ColumnA| START_DT| END_DT|
#+-------+----------+----------+
#| 1|2016-01-01|2020-02-04|
#+-------+----------+----------+
Related videos on Youtube
Joaquin
Physics student (Master Degree) in Universidad de Buenos Aires. Coding in R and Cypher/Neo4j for my thesis: "Physical-statistical Analysis of Organization Patterns in Complex Systems, Complex Networks and Knowledge Extraction in Biology Systems".
Updated on June 04, 2022Comments
-
Joaquin almost 2 years
I'm using pyspark 2.1 and i have a dataframe with two columns with date format like this:
Column A , START_DT , END_DT 1 , 2016-01-01 , 2020-02-04 16 , 2017-02-23 , 2017-12-24
I want to filter for a certain date (for example
2018-12-31
) between the date fromSTART_DT
andEND_DT
(in the example there, the second row would be filtered).Both
START_DT
andEND_DT
columns are already in date format, i was looking for a method like the sql:SELECT * FROM MYTABLE WHERE '2018-12-31' BETWEEN start_dt AND end_dt
-
Ajak6 almost 3 yearsDoes it need the column to be of type column or StringType will work as well?