Finding occurrences of a word in a string in python 3

79,585

Solution 1

If you're going for efficiency:

import re
count = sum(1 for _ in re.finditer(r'\b%s\b' % re.escape(word), input_string))

This doesn't need to create any intermediate lists (unlike split()) and thus will work efficiently for large input_string values.

It also has the benefit of working correctly with punctuation - it will properly return 1 as the count for the phrase "Mike saw a dog." (whereas an argumentless split() would not). It uses the \b regex flag, which matches on word boundaries (transitions between \w a.k.a [a-zA-Z0-9_] and anything else).

If you need to worry about languages beyond the ASCII character set, you may need to adjust the regex to properly match non-word characters in those languages, but for many applications this would be an overcomplication, and in many other cases setting the unicode and/or locale flags for the regex would suffice.

Solution 2

You can use str.split() to convert the sentence to a list of words:

a = 'the dogs barked'.split()

This will create the list:

['the', 'dogs', 'barked']

You can then count the number of exact occurrences using list.count():

a.count('dog')  # 0
a.count('dogs') # 1

If it needs to work with punctuation, you can use regular expressions. For example:

import re
a = re.split(r'\W', 'the dogs barked.')
a.count('dogs') # 1

Solution 3

Use a list comprehension:

>>> word = "dog"
>>> str1 = "the dogs barked"
>>> sum(i == word for word in str1.split())
0

>>> word = 'dog'
>>> str1 = 'the dog barked'
>>> sum(i == word for word in str1.split())
1

split() returns a list of all the words in a sentence. Then we use a list comprehension to count how many times the word appears in a sentence.

Solution 4

import re

word = "dog"
str = "the dogs barked"
print len(re.findall(word, str))

Solution 5

You need to split the sentence into words. For you example you can do that with just

words = str1.split()

But for real word usage you need something more advanced that also handles punctuation. For most western languages you can get away with replacing all punctuation with spaces before doing str1.split().

This will work for English as well in simple cases, but note that "I'm" will be split into two words: "I" and "m", and it should in fact be split into "I" and "am". But this may be overkill for this application.

For other cases such as Asian language, or actual real world usage of English, you might want to use a library that does the word splitting for you.

Then you have a list of words, and you can do

count = words.count(word)
Share:
79,585

Related videos on Youtube

lost9123193
Author by

lost9123193

Learning

Updated on December 05, 2021

Comments

  • lost9123193
    lost9123193 over 2 years

    I'm trying to find the number of occurrences of a word in a string.

    word = "dog"
    str1 = "the dogs barked"
    

    I used the following to count the occurrences:

    count = str1.count(word)
    

    The issue is I want an exact match. So the count for this sentence would be 0. Is that possible?

  • Amber
    Amber almost 11 years
    To whomever downvoted this: if you're going to downvote, it's usually a good idea to at least leave a comment explaining why.
  • Amber
    Amber almost 11 years
    This is probably the simplest method, but do note that it will fail for strings that include punctuation next to the counted word.
  • TerryA
    TerryA almost 11 years
    @LennartRegebro Does not mean you should downvote the answer. The answer is correct
  • Amber
    Amber almost 11 years
    @LennartRegebro That's not a useful statement. People who post answers on StackOverflow often want to learn just as much as people who post questions do; useful and actionable feedback is an important part of that.
  • TerryA
    TerryA almost 11 years
    "It's not a good answer" please tell me how I could improve :)
  • Lennart Regebro
    Lennart Regebro almost 11 years
    A "\W" regexp will fail for any foreign words such as café, which is a drawback.
  • Amber
    Amber almost 11 years
    @LennartRegebro I am calm; you seem to think that I'm worked up because I'm disapproving of the manner in which you've been responding, but that's not the case. I simply would like to see more constructive interaction. My original comment simply asked for such constructive commenting; you chose to interpret that as impatient when it was nothing of the sort. Either way, this is the last I'll comment in this particular area; I have no desire to draw this out. Feel free to get the last word if you would like.
  • lost9123193
    lost9123193 almost 11 years
    worked like a charm! Not sure why there's a downvote. Could you explain what exactly's going on or where I could look for this? I've never seen a for loop with an underscore. Thanks!
  • TerryA
    TerryA almost 11 years
    @lost9123193 _ is often used as a placeholder in for loops :). I'm sure Amber could explain it better :p
  • TerryA
    TerryA almost 11 years
    Both of you calm down. Now, could you please explain why you downvoted (in regards to how "It's not a good answer")
  • Lennart Regebro
    Lennart Regebro almost 11 years
    Your sum() implementation is just an inefficient reimplementation of the count() method that already exists on lists. Use .count(word) instead.
  • Amber
    Amber almost 11 years
    @lost9123193 - A _ is simply a dummy variable, a way of saying "I don't actually care about the value here." In this case, I'm using it because we're always summing up 1s for the count; we don't actually care about the match objects returned from re.finditer().
  • Lennart Regebro
    Lennart Regebro almost 11 years
    @Haidro: This answer is not correct, for a useful definition of correct. This is not a maths tests where you get points for having the right number in the end.
  • Amber
    Amber almost 11 years
    Also if you're wondering what the re bit is - docs.python.org/2/library/re.html
  • Lennart Regebro
    Lennart Regebro almost 11 years
    But I do apologize for not noticing earlier that the impatient one and the one who posted the answer was different people. If I had realized this, I would have given my explanation immediately. Sorry.
  • Lennart Regebro
    Lennart Regebro almost 11 years
    The downvote was for an earlier incarnation of the answer that was incorrect.
  • Lennart Regebro
    Lennart Regebro almost 11 years
    Oh, hey, the Unicode flag is default in Python 3. So yes. But I found another potential issue, "I'm" will be two words, "I" and "m".
  • grc
    grc almost 11 years
    @LennartRegebro and there's also an issue with hyphenated words.
  • Lennart Regebro
    Lennart Regebro almost 11 years
    @Haidro: As a final statement on this: You might want to hover your mouse over the up and down arrows and notice what they say. But otherwise, by all means, go on correcting people who has been members for ten times as long as you on how Stackoverflow works. :-)
  • Lennart Regebro
    Lennart Regebro almost 11 years
    @grc, If you want to count them as one, word, yes. That's a matte of taste, I guess. :-)
  • Lennart Regebro
    Lennart Regebro almost 11 years
    Haha, now this got downvoted for no reason. I suspect childishness. ;-) But I already have over 20k, so I don't mind, downvote on.
  • Lennart Regebro
    Lennart Regebro almost 11 years
    OK, I'm glad to hear that.
  • jamylak
    jamylak almost 11 years
    I like this but you should actually just simplify it to sum(i == word for word in str1.split()). That would be the most pythonic way to do it
  • Lennart Regebro
    Lennart Regebro almost 11 years
    @jamylak: That relies on int(True) being 1, which may be sorter, but harder to understand than the original. And is still slower than simply calling .count().
  • jamylak
    jamylak almost 11 years
    @LennartRegebro .count is better, I agree, "That relies on int(True) being 1" Did you even read the huge highlighted link or not?
  • Lennart Regebro
    Lennart Regebro almost 11 years
    @jamylak: Yes. So? It still means you have to know this and consider it when reading the code. It makes it harder to understand than the original. Claiming it's the most pythonic way to do it is patent nonsense.
  • RetroCode
    RetroCode over 7 years
    what's <br> for ?
  • Alok Prasad
    Alok Prasad over 3 years
    Only Problem with this is dogs and dog are two different word , your soultion is giving 1 as output , ideally it should give 0.
  • Leland Hepworth
    Leland Hepworth almost 3 years
    For information about the difference between re.finditer and re.findall, check out this link: medium.com/geoblinktech/…