multiple conditions for filter in spark data frames

151,366

Solution 1

Instead of:

df2 = df1.filter("Status=2" || "Status =3")

Try:

df2 = df1.filter($"Status" === 2 || $"Status" === 3)

Solution 2

This question has been answered but for future reference, I would like to mention that, in the context of this question, the where and filter methods in Dataset/Dataframe supports two syntaxes: The SQL string parameters:

df2 = df1.filter(("Status = 2 or Status = 3"))

and Col based parameters (mentioned by @David ):

df2 = df1.filter($"Status" === 2 || $"Status" === 3)

It seems the OP'd combined these two syntaxes. Personally, I prefer the first syntax because it's cleaner and more generic.

Solution 3

In spark/scala, it's pretty easy to filter with varargs.

val d = spark.read...//data contains column named matid
val ids = Seq("BNBEL0608AH", "BNBEL00608H")
val filtered = d.filter($"matid".isin(ids:_*))

Solution 4

df2 = df1.filter("Status = 2 OR Status = 3")

Worked for me.

Solution 5

In java spark dataset it can be used as

Dataset userfilter = user.filter(col("gender").isin("male","female"));

Share:
151,366
dheee
Author by

dheee

Updated on August 11, 2021

Comments

  • dheee
    dheee over 2 years

    I have a data frame with four fields. one of the field name is Status and i am trying to use a OR condition in .filter for a dataframe . I tried below queries but no luck.

    df2 = df1.filter(("Status=2") || ("Status =3"))
    
    df2 = df1.filter("Status=2" || "Status =3")
    

    Has anyone used this before. I have seen a similar question on stack overflow here . They have used below code for using OR condition. But that code is for pyspark.

    from pyspark.sql.functions import col 
    
    numeric_filtered = df.where(
    (col('LOW')    != 'null') | 
    (col('NORMAL') != 'null') |
    (col('HIGH')   != 'null'))
    numeric_filtered.show()
    
  • Boern
    Boern over 7 years
    The opposite of === is =!=
  • David Griffin
    David Griffin over 7 years
    Depends on version -- for pre-2.0, use !== but after version 2.0.0 !== does not have the same precedence as ===, use =!= instead
  • Omkar Puttagunta
    Omkar Puttagunta about 7 years
    How to filter DF on multiple columns in Java. I have something like this df.filter(df.col("name").equalTo("john")). I want to filter on multiple columns in a single line?
  • David Schuler
    David Schuler about 7 years
    You can just add another .filter after your current one. df.filter(df.col("name").equalTo("john")).filter(df.col("nam‌​e").equalTo("tim"))
  • WestCoastProjects
    WestCoastProjects almost 7 years
    @DavidSchuler Do those chained filters get merged into a single worker stage in the Spark Analyzer?
  • biobirdman
    biobirdman over 4 years
    this is a AND filter not a OR
  • Vicky
    Vicky about 4 years
    @DavidGriffin how to add "is null" condition with OR(||) here?
  • J. P
    J. P almost 4 years
    This condition first filters the dataset where Status=2 and then filters the resulting dataset where Status=3, and hence it is AND condition and not OR
  • Joshua Chen
    Joshua Chen over 2 years
    I think you need to write as myList:_* to convert to varargs, or use isInCollection