multiple conditions for filter in spark data frames
Solution 1
Instead of:
df2 = df1.filter("Status=2" || "Status =3")
Try:
df2 = df1.filter($"Status" === 2 || $"Status" === 3)
Solution 2
This question has been answered but for future reference, I would like to mention that, in the context of this question, the where
and filter
methods in Dataset/Dataframe supports two syntaxes:
The SQL string parameters:
df2 = df1.filter(("Status = 2 or Status = 3"))
and Col based parameters (mentioned by @David ):
df2 = df1.filter($"Status" === 2 || $"Status" === 3)
It seems the OP'd combined these two syntaxes. Personally, I prefer the first syntax because it's cleaner and more generic.
Solution 3
In spark/scala, it's pretty easy to filter with varargs.
val d = spark.read...//data contains column named matid
val ids = Seq("BNBEL0608AH", "BNBEL00608H")
val filtered = d.filter($"matid".isin(ids:_*))
Solution 4
df2 = df1.filter("Status = 2 OR Status = 3")
Worked for me.
Solution 5
In java spark dataset it can be used as
Dataset userfilter = user.filter(col("gender").isin("male","female"));
dheee
Updated on August 11, 2021Comments
-
dheee over 2 years
I have a data frame with four fields. one of the field name is Status and i am trying to use a OR condition in .filter for a dataframe . I tried below queries but no luck.
df2 = df1.filter(("Status=2") || ("Status =3")) df2 = df1.filter("Status=2" || "Status =3")
Has anyone used this before. I have seen a similar question on stack overflow here . They have used below code for using OR condition. But that code is for pyspark.
from pyspark.sql.functions import col numeric_filtered = df.where( (col('LOW') != 'null') | (col('NORMAL') != 'null') | (col('HIGH') != 'null')) numeric_filtered.show()
-
Boern over 7 yearsThe opposite of
===
is=!=
-
David Griffin over 7 yearsDepends on version -- for pre-2.0, use
!==
but after version 2.0.0!==
does not have the same precedence as===
, use=!=
instead -
Omkar Puttagunta about 7 yearsHow to filter DF on multiple columns in Java. I have something like this
df.filter(df.col("name").equalTo("john"))
. I want to filter on multiple columns in a single line? -
David Schuler about 7 yearsYou can just add another .filter after your current one. df.filter(df.col("name").equalTo("john")).filter(df.col("name").equalTo("tim"))
-
WestCoastProjects almost 7 years@DavidSchuler Do those chained filters get merged into a single worker stage in the Spark Analyzer?
-
biobirdman over 4 yearsthis is a AND filter not a OR
-
Vicky about 4 years@DavidGriffin how to add "is null" condition with OR(||) here?
-
J. P almost 4 yearsThis condition first filters the dataset where Status=2 and then filters the resulting dataset where Status=3, and hence it is AND condition and not OR
-
Joshua Chen over 2 yearsI think you need to write as
myList:_*
to convert to varargs, or useisInCollection