Filter pyspark dataframe if contains a list of strings
IIUC, you want to return the rows in which column_a
is "like" (in the SQL sense) any of the values in list_a
.
One way is to use functools.reduce
:
from functools import reduce
list_a = ['string', 'third']
df1 = df.where(
reduce(lambda a, b: a|b, (df['column_a'].like('%'+pat+"%") for pat in list_a))
)
df1.show()
#+------------+-----+
#| column_a|count|
#+------------+-----+
#| some_string| 10|
#|third_string| 30|
#+------------+-----+
Essentially you loop over all of the possible strings in list_a
to compare in like
and "OR" the results. Here is the execution plan:
df1.explain()
#== Physical Plan ==
#*(1) Filter (Contains(column_a#0, string) || Contains(column_a#0, third))
#+- Scan ExistingRDD[column_a#0,count#1]
Another option is to use pyspark.sql.Column.rlike
instead of like
.
df2 = df.where(
df['column_a'].rlike("|".join(["(" + pat + ")" for pat in list_a]))
)
df2.show()
#+------------+-----+
#| column_a|count|
#+------------+-----+
#| some_string| 10|
#|third_string| 30|
#+------------+-----+
Which has the corresponding execution plan:
df2.explain()
#== Physical Plan ==
#*(1) Filter (isnotnull(column_a#0) && column_a#0 RLIKE (string)|(third))
#+- Scan ExistingRDD[column_a#0,count#1]
Saeed Esmaili
Updated on June 22, 2022Comments
-
Saeed Esmaili almost 2 years
Suppose that we have a pyspark dataframe that one of its columns (
column_a
) contains some string values, and also there is a list of strings (list_a
).Dataframe:
column_a | count some_string | 10 another_one | 20 third_string | 30
list_a:
['string', 'third', ...]
I want to filter this dataframe and only keep the rows if column_a's value contains one of list_a's items.
This is the code that works to filter the
column_a
based on a single string:df['column_a'].like('%string_value%')
But how can we get the same result for a list of strings? (Keep the rows that column_a's value is 'string', 'third', ...)