Filtering a Pyspark DataFrame with SQL-like IN clause
Solution 1
String you pass to SQLContext
it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:
df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
## 2
Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.
In practice DataFrame
DSL is a much better choice when you want to create dynamic queries:
from pyspark.sql.functions import col
df.where(col("v").isin({"foo", "bar"})).count()
## 2
It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.
Solution 2
reiterating what @zero323 has mentioned above : we can do the same thing using a list as well (not only set
) like below
from pyspark.sql.functions import col
df.where(col("v").isin(["foo", "bar"])).count()
Solution 3
Just a little addition/update:
choice_list = ["foo", "bar", "jack", "joan"]
If you want to filter your dataframe "df", such that you want to keep rows based upon a column "v" taking only the values from choice_list, then
from pyspark.sql.functions import col
df_filtered = df.where( ( col("v").isin (choice_list) ) )
Solution 4
You can also do this for integer columns:
df_filtered = df.filter("field1 in (1,2,3)")
or this for string columns:
df_filtered = df.filter("field1 in ('a','b','c')")
Related videos on Youtube
mar tin
Updated on July 09, 2022Comments
-
mar tin almost 2 years
I want to filter a Pyspark DataFrame with a SQL-like
IN
clause, as insc = SparkContext() sqlc = SQLContext(sc) df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')
where
a
is the tuple(1, 2, 3)
. I am getting this error:java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found
which is basically saying it was expecting something like '(1, 2, 3)' instead of a. The problem is I can't manually write the values in a as it's extracted from another job.
How would I filter in this case?
-
mar tin about 8 yearsFor the second method, you can achieve the same by doing df.where(df.v.isin({"foo", "bar"})).count()
-
zero323 about 8 yearsYou can, but personally I don't like this approach. With
col
I can easily decouple SQL expression and particularDataFrame
object. So you can for example keep a dictionary of useful expressions and just pick them when you need. With explicit DF object you'll have to put it inside a function and it doesn't compose that well. -
Stefan Falk about 6 yearsHow can this be done with a list of tuples? If I have e.g.
[(1,1), (1,2), (1,3)]
where one isaid
and the other isbid
for example. It would have to be something likecol(['aid', 'bid]).isin([(1,1), (1,2)])
-
E B about 6 years@zero323 is there a negation of is in LIKE not in sparksql.
-
pissall almost 6 yearsYes. You can use '~'