Filtering a Pyspark DataFrame with SQL-like IN clause

93,243

Solution 1

String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:

df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
##  2 

Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.

In practice DataFrame DSL is a much better choice when you want to create dynamic queries:

from pyspark.sql.functions import col

df.where(col("v").isin({"foo", "bar"})).count()
## 2

It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.

Solution 2

reiterating what @zero323 has mentioned above : we can do the same thing using a list as well (not only set) like below

from pyspark.sql.functions import col

df.where(col("v").isin(["foo", "bar"])).count()

Solution 3

Just a little addition/update:

choice_list = ["foo", "bar", "jack", "joan"]

If you want to filter your dataframe "df", such that you want to keep rows based upon a column "v" taking only the values from choice_list, then

from pyspark.sql.functions import col

df_filtered = df.where( ( col("v").isin (choice_list) ) )

Solution 4

You can also do this for integer columns:

df_filtered = df.filter("field1 in (1,2,3)")

or this for string columns:

df_filtered = df.filter("field1 in ('a','b','c')")
Share:
93,243

Related videos on Youtube

mar tin
Author by

mar tin

Updated on July 09, 2022

Comments

  • mar tin
    mar tin almost 2 years

    I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in

    sc = SparkContext()
    sqlc = SQLContext(sc)
    df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')
    

    where a is the tuple (1, 2, 3). I am getting this error:

    java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found

    which is basically saying it was expecting something like '(1, 2, 3)' instead of a. The problem is I can't manually write the values in a as it's extracted from another job.

    How would I filter in this case?

  • mar tin
    mar tin about 8 years
    For the second method, you can achieve the same by doing df.where(df.v.isin({"foo", "bar"})).count()
  • zero323
    zero323 about 8 years
    You can, but personally I don't like this approach. With col I can easily decouple SQL expression and particular DataFrame object. So you can for example keep a dictionary of useful expressions and just pick them when you need. With explicit DF object you'll have to put it inside a function and it doesn't compose that well.
  • Stefan Falk
    Stefan Falk about 6 years
    How can this be done with a list of tuples? If I have e.g. [(1,1), (1,2), (1,3)] where one is aid and the other is bid for example. It would have to be something like col(['aid', 'bid]).isin([(1,1), (1,2)])
  • E B
    E B about 6 years
    @zero323 is there a negation of is in LIKE not in sparksql.
  • pissall
    pissall almost 6 years
    Yes. You can use '~'