How to filter column on values in list in pyspark?
60,442
The function between
is used to check if the value is between two values, the input is a lower bound and an upper bound. It can not be used to check if a column value is in a list. To do that, use isin
:
import pyspark.sql.functions as f
df = dfRawData.where(f.col("X").isin(["CB", "CI", "CR"]))
Related videos on Youtube
Author by
LKA
Updated on July 05, 2022Comments
-
LKA almost 2 years
I have a dataframe rawdata on which i have to apply filter condition on column X with values CB,CI and CR. So I used the below code:
df = dfRawData.filter(col("X").between("CB","CI","CR"))
But I am getting the following error:
between() takes exactly 3 arguments (4 given)
Please let me know how I can resolve this issue.
-
bantmen over 4 yearsRelated: stackoverflow.com/a/58541958/3712254. I found the
join
implementation to be faster thanwhere
.
-
-
DataBach over 2 yearsExtending on @Shaido's answer. If you want to negate the statement you can use the
~
sign like so:df = dfRawData.where(~f.col("X").isin(["CB", "CI", "CR"]))