How to test if a string contains one of the substrings in a list, in pandas?

230,049

Solution 1

One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).

You can construct the regex by joining the words in searchfor with |:

>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0    cat
1    hat
2    dog
3    fog
dtype: object

As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.

You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:

>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']

The strings with in this new list will match each character literally when used with str.contains.

Solution 2

You can use str.contains alone with a regex pattern using OR (|):

s[s.str.contains('og|at')]

Or you could add the series to a dataframe then use str.contains:

df = pd.DataFrame(s)
df[s.str.contains('og|at')] 

Output:

0 cat
1 hat
2 dog
3 fog 

Solution 3

Here is a one line lambda that also works:

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

Input:

searchfor = ['og', 'at']

df = pd.DataFrame([('cat', 1000.0), ('hat', 2000000.0), ('dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])

   col1  col2
0   cat 1000.0
1   hat 2000000.0
2   dog 1000.0
3   fog 330000.0
4   pet 330000.0

Apply Lambda:

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

Output:

    col1    col2        TrueFalse
0   cat     1000.0      1
1   hat     2000000.0   1
2   dog     1000.0      1
3   fog     330000.0    1
4   pet     330000.0    0
Share:
230,049

Related videos on Youtube

ari
Author by

ari

Updated on November 06, 2021

Comments

  • ari
    ari over 2 years

    Is there any function that would be the equivalent of a combination of df.isin() and df[col].str.contains()?

    For example, say I have the series s = pd.Series(['cat','hat','dog','fog','pet']), and I want to find all places where s contains any of ['og', 'at'], I would want to get everything but 'pet'.

    I have a solution, but it's rather inelegant:

    searchfor = ['og', 'at']
    found = [s.str.contains(x) for x in searchfor]
    result = pd.DataFrame[found]
    result.any()
    

    Is there a better way to do this?

    • jpp
      jpp about 6 years
      Note: There is a solution described by @unutbu which is more efficient than using pd.Series.str.contains. If performance is an issue, then this may be worth investigating.
    • cs95
      cs95 about 5 years
      Highly recommend checking out this answer for partial string search using multiple keywords/regexes (scroll down to the "Multiple Substring Search" subheading).
  • goofd
    goofd over 9 years
    maybe good to add this link pandas.pydata.org/pandas-docs/stable/… too. Starting from pandas 0.15, the string operations are even easier
  • Andy Hayden
    Andy Hayden over 9 years
    one thing you have to take care with is if a string in searchfor has special regex characters (you can map with re.escape).
  • Alex Riley
    Alex Riley over 9 years
    @AndyHayden Thank you, I've improved my answer to take this complication into account.
  • Doo Hyun Shin
    Doo Hyun Shin about 5 years
    I don't know why your method doesn't work with "str.startswith('|'.join(searchfor))"
  • JacoSolari
    JacoSolari about 4 years
    how to do it for AND?
  • James
    James about 4 years
    @JacoSolari check out this answer stackoverflow.com/questions/37011734/…
  • JacoSolari
    JacoSolari about 4 years
    @James yes, thanks. For completion here is the most upvoted oneliner in that answer. df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)
  • emremrah
    emremrah over 3 years
    I did it as df.loc[df.col1.apply(lambda x: True if any(i in x for i in searchfor) else False)] and it gone well, thanks.
  • The Dan
    The Dan over 3 years
    in this case I understand we use "|" for OR, how could we use AND??