How to filter column on values in list in pyspark?

60,442

The function between is used to check if the value is between two values, the input is a lower bound and an upper bound. It can not be used to check if a column value is in a list. To do that, use isin:

import pyspark.sql.functions as f
df = dfRawData.where(f.col("X").isin(["CB", "CI", "CR"]))
Share:
60,442

Related videos on Youtube

LKA
Author by

LKA

Updated on July 05, 2022

Comments

  • LKA
    LKA almost 2 years

    I have a dataframe rawdata on which i have to apply filter condition on column X with values CB,CI and CR. So I used the below code:

    df = dfRawData.filter(col("X").between("CB","CI","CR"))
    

    But I am getting the following error:

    between() takes exactly 3 arguments (4 given)

    Please let me know how I can resolve this issue.

  • DataBach
    DataBach over 2 years
    Extending on @Shaido's answer. If you want to negate the statement you can use the ~ sign like so: df = dfRawData.where(~f.col("X").isin(["CB", "CI", "CR"]))