How to filter Pandas Dataframe rows which contains any string from a list?

11,319

Solution 1

Setup

df = pd.DataFrame(dict(
    A=['I need avocado', 'something', 'useless', 'nothing'],
    B=['something', 'I eat margarina', 'eat apple', 'more nothing']
))
includeKeywords = ["apple", "avocado", "bannana"]

Problem

                A                B
0  I need avocado        something  # True 'avocado' in A
1       something  I eat margarina
2         useless        eat apple  # True 'apple' in B
3         nothing     more nothing

Solution


df[df.stack().str.contains('|'.join(includeKeywords)).any(level=0)]

                A          B
0  I need avocado  something
2         useless  eat apple

Details

This produces a regex search string. In regex, '|' means or. So for a regex search, this says match 'apple', 'avocado', or 'bannana'

kwstr = '|'.join(includeKeywords)
print(kwstr)

apple|avocado|bannana

Stacking will flatten our DataFrame

df.stack()

0  A     I need avocado
   B          something
1  A          something
   B    I eat margarina
2  A            useless
   B          eat apple
3  A            nothing
   B       more nothing
dtype: object

Fortunately, the pandas.Series.str.contains method can handle regex and it will produce a boolean Series

df.stack().str.contains(kwstr)

0  A     True
   B    False
1  A    False
   B    False
2  A    False
   B     True
3  A    False
   B    False
dtype: bool

At which point we can cleverly use pandas.Series.any by suggesting it only care about level=0

mask = df.stack().str.contains(kwstr).any(level=0)
mask

0     True
1    False
2     True
3    False
dtype: bool

By using level=0 we preserved the original index in the resulting Series. This makes it perfect for filtering df

df[mask]

                A          B
0  I need avocado  something
2         useless  eat apple

Solution 2

Take advantage of the any() function and use a list comprenesion in an df.apply()

df = pd.DataFrame(["I need avocado","I eat margarina"])
print(df)
#                  0
# 0   I need avocado
# 1  I eat margarina

includeKeywords = ["apple","avocado","bannana"]

print(df[df.apply(lambda r: any([kw in r[0] for kw in includeKeywords]), axis=1)])
#                 0
# 0  I need avocado

To make this a bit clearer, you basically need to make a mask that returns True/False for each row

mask = [any([kw in r for kw in includeKeywords]) for r in df[0]]
print(mask)

Then you can use that mask to print the selected rows in your DataFrame

# [True, False]
print(df[mask])
#                 0
# 0  I need avocado

I am showing you both ways because while the df.apply() method is handy for a one liner, it is really slow compared to a standard list comprehension. So if you have a small enough set, feel free to use df.apply(). Otherwise, I'd suggest the python comprehension over the pandas method.

Share:
11,319
BestR
Author by

BestR

Updated on July 19, 2022

Comments

  • BestR
    BestR almost 2 years

    I have dataframe that has values like those:

      A                    B
    ["I need avocado"   "something"]
    ["something"      "I eat margarina"]
    

    And I want to find rows that:

    In any column of the row, the column's value is contained in a list. for example, for the list:

    ["apple","avocado","bannana"]
    

    And only this line should match: ["I need avocado" "something"]

    This line doesnt work:

    dataFiltered[dataFiltered[col].str.contains(*includeKeywords)]
    

    Returns:

    {TypeError}unsupported operand type(s) for &: 'str' and 'int'
    

    What should I do?

  • BestR
    BestR almost 5 years
    Thank you. Your first solution didnt work for me. The second one only works for one column. I want to get all rows that contains the values of the list in any column
  • BestR
    BestR almost 5 years
    I can "join" masks to one mask, but is there a way with pandas?
  • Cohan
    Cohan almost 5 years
    Well, there are several ways you could do it. You could put another comprehension layer, or make a series of masks in a list of lists than zip that up and do [any(i1, i2, i3,...) for i1, i2, i3 in zip(m1, m2, m3)]
  • BestR
    BestR almost 5 years
    Wow! short and perfect