Filter pyspark dataframe if contains a list of strings

10,407

IIUC, you want to return the rows in which column_a is "like" (in the SQL sense) any of the values in list_a.

One way is to use functools.reduce:

from functools import reduce

list_a = ['string', 'third']

df1 = df.where(
    reduce(lambda a, b: a|b, (df['column_a'].like('%'+pat+"%") for pat in list_a))
)
df1.show()
#+------------+-----+
#|    column_a|count|
#+------------+-----+
#| some_string|   10|
#|third_string|   30|
#+------------+-----+

Essentially you loop over all of the possible strings in list_a to compare in like and "OR" the results. Here is the execution plan:

df1.explain()
#== Physical Plan ==
#*(1) Filter (Contains(column_a#0, string) || Contains(column_a#0, third))
#+- Scan ExistingRDD[column_a#0,count#1]

Another option is to use pyspark.sql.Column.rlike instead of like.

df2 = df.where(
    df['column_a'].rlike("|".join(["(" + pat + ")" for pat in list_a]))
)

df2.show()
#+------------+-----+
#|    column_a|count|
#+------------+-----+
#| some_string|   10|
#|third_string|   30|
#+------------+-----+

Which has the corresponding execution plan:

df2.explain()
#== Physical Plan ==
#*(1) Filter (isnotnull(column_a#0) && column_a#0 RLIKE (string)|(third))
#+- Scan ExistingRDD[column_a#0,count#1]
Share:
10,407
Saeed Esmaili
Author by

Saeed Esmaili

Updated on June 22, 2022

Comments

  • Saeed Esmaili
    Saeed Esmaili almost 2 years

    Suppose that we have a pyspark dataframe that one of its columns (column_a) contains some string values, and also there is a list of strings (list_a).

    Dataframe:

    column_a      | count
    some_string   |  10
    another_one   |  20
    third_string  |  30
    

    list_a:

    ['string', 'third', ...]
    

    I want to filter this dataframe and only keep the rows if column_a's value contains one of list_a's items.

    This is the code that works to filter the column_a based on a single string:

    df['column_a'].like('%string_value%')
    

    But how can we get the same result for a list of strings? (Keep the rows that column_a's value is 'string', 'third', ...)