python: count word tokens in sentence

24,355

Solution 1

The problem with en.split(' ') is that you have extra whitespace in your string, which gives empty matches. You could fix this quite easily by calling en.split() instead.

But perhaps you could use this different approach using a regular expression (and now there is no need to remove the punctuation first):

import re
print len(re.findall(r'\w+', line))

See it working online: ideone

Solution 2

Instead of using the regex \w+ it is much faster to use \b for counting words, like so:

import re
_re_word_boundaries = re.compile(r'\b')

def num_words(line):
    return len(_re_word_boundaries.findall(line)) >> 1

Note that we have to halve the number because \b matches at both the beginning and the end of a word. Unfortunately, unlike egrep, Python does not support matching at only the beginning or the end.

If you have very long lines and are concerned about memory, using an iterator may be a better solution:

def num_words(line):
    return sum(1 for word in _re_word_boundaries.finditer(line)) >> 1

Solution 3

You can use NLTK:

import nltk
en = "i ccc bcc the a of the abc ccc dd on aaa 28 abc 19 "
print(len(nltk.word_tokenize(en)))

Output:

15

Solution 4

def main():

# get the user msg
    print "this program tells you how many words are in your sentence."
    message = raw_input("Enter message: ")

    wrdcount = 0
    for i in message.split():
        eawrdlen = len(i) / len(i)
        wrdcount = wrdcount + eawrdlen
    print wrdcount


main()

Solution 5

The len function counts the length of the variable, which in this case, is the length of the string, which is 30 characters. To count words, you'll need to split the string on whitespace, and then count the number of items which are returned.

Share:
24,355
Duke
Author by

Duke

Updated on August 12, 2020

Comments

  • Duke
    Duke over 3 years

    I'm trying to count the number of words in a string. however, i first have to strip some punctuations e.g.

    line = "i want you , to know , my name . "
    

    running

    en = line.translate(string.maketrans('', ''), '!,.?')
    

    produces

    en = "i want you  to know  my name  "
    

    after this, i want to count the number of words in the line. but when i do len(en) I get 30 instead of 7.

    Using split on en to tokenize and find the length doesn't work in all cases. e.g.

    i tried that it doesn't always work. e.g. consider this string.

    "i ccc bcc the a of the abc ccc dd on aaa , 28 abc 19 ."
    

    en then becomes:

    "i ccc bcc the a of the abc ccc dd on aaa 28 abc 19 "
    

    but len(en) returns 17 and not 15.

    can you please help? thanks