Efficiently calculate word frequency in a string

31,355

Solution 1

Use collections.Counter:

>>> from collections import Counter
>>> test = 'abc def abc def zzz zzz'
>>> Counter(test.split()).most_common()
[('abc', 2), ('zzz', 2), ('def', 2)]

Solution 2

>>>> test = """abc def-ghi jkl abc
abc"""
>>> from collections import Counter
>>> words = Counter()
>>> words.update(test.split()) # Update counter with words
>>> words.most_common()        # Print list with most common to least common
[('abc', 3), ('jkl', 1), ('def-ghi', 1)]

Solution 3

You can also use NLTK (Natural Language ToolKit). It provide very nice libraries for studying the processing the texts. for this example you can use:

from nltk import FreqDist

text = "aa bb cc aa bb"
fdist1 = FreqDist(text)

# show most 10 frequent word in the text
print fdist1.most_common(10)

the result will be:

[('aa', 2), ('bb', 2), ('cc', 1)]
Share:
31,355
sazr
Author by

sazr

Updated on March 25, 2021

Comments

  • sazr
    sazr about 3 years

    I am parsing a long string of text and calculating the number of times each word occurs in Python. I have a function that works but I am looking for advice on whether there are ways I can make it more efficient(in terms of speed) and whether there's even python library functions that could do this for me so I'm not reinventing the wheel?

    Can you suggest a more efficient way to calculate the most common words that occur in a long string(usually over 1000 words in the string)?

    Also whats the best way to sort the dictionary into a list where the 1st element is the most common word, the 2nd element is the 2nd most common word and etc?

    test = """abc def-ghi jkl abc
    abc"""
    
    def calculate_word_frequency(s):
        # Post: return a list of words ordered from the most
        # frequent to the least frequent
    
        words = s.split()
        freq  = {}
        for word in words:
            if freq.has_key(word):
                freq[word] += 1
            else:
                freq[word] = 1
        return sort(freq)
    
    def sort(d):
        # Post: sort dictionary d into list of words ordered
        # from highest freq to lowest freq
        # eg: For {"the": 3, "a": 9, "abc": 2} should be
        # sorted into the following list ["a","the","abc"]
    
        #I have never used lambda's so I'm not sure this is correct
        return d.sort(cmp = lambda x,y: cmp(d[x],d[y]))
    
    print calculate_word_frequency(test)