Efficiently count word frequencies in python

python nlp scikit-learn word-count frequency-distribution

35,768

Solution 1

The most succinct approach is to use the tools Python gives you.

from future_builtins import map  # Only on Python 2

from collections import Counter
from itertools import chain

def countInFile(filename):
    with open(filename) as f:
        return Counter(chain.from_iterable(map(str.split, f)))

That's it. map(str.split, f) is making a generator that returns lists of words from each line. Wrapping in chain.from_iterable converts that to a single generator that produces a word at a time. Counter takes an input iterable and counts all unique values in it. At the end, you return a dict-like object (a Counter) that stores all unique words and their counts, and during creation, you only store a line of data at a time and the total counts, not the whole file at once.

In theory, on Python 2.7 and 3.1, you might do slightly better looping over the chained results yourself and using a dict or collections.defaultdict(int) to count (because Counter is implemented in Python, which can make it slower in some cases), but letting Counter do the work is simpler and more self-documenting (I mean, the whole goal is counting, so use a Counter). Beyond that, on CPython (the reference interpreter) 3.2 and higher Counter has a C level accelerator for counting iterable inputs that will run faster than anything you could write in pure Python.

Update: You seem to want punctuation stripped and case-insensitivity, so here's a variant of my earlier code that does that:

from string import punctuation

def countInFile(filename):
    with open(filename) as f:
        linewords = (line.translate(None, punctuation).lower().split() for line in f)
        return Counter(chain.from_iterable(linewords))

Your code runs much more slowly because it's creating and destroying many small Counter and set objects, rather than .update-ing a single Counter once per line (which, while slightly slower than what I gave in the updated code block, would be at least algorithmically similar in scaling factor).

Solution 2

A memory efficient and accurate way is to make use of

CountVectorizer in scikit (for ngram extraction)
NLTK for word_tokenize
numpy matrix sum to collect the counts
collections.Counter for collecting the counts and vocabulary

An example:

import urllib.request
from collections import Counter

import numpy as np 

from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

# Our sample textfile.
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')


# Note that `ngram_range=(1, 1)` means we want to extract Unigrams, i.e. tokens.
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
# X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary
X = ngram_vectorizer.fit_transform(data.split('\n'))

# Vocabulary
vocab = list(ngram_vectorizer.get_feature_names())

# Column-wise sum of the X matrix.
# It's some crazy numpy syntax that looks horribly unpythonic
# For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
# and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
counts = X.sum(axis=0).A1

freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))

[out]:

[(',', 32000),
 ('.', 17783),
 ('de', 11225),
 ('a', 7197),
 ('que', 5710),
 ('la', 4732),
 ('je', 4304),
 ('se', 4013),
 ('на', 3978),
 ('na', 3834)]

Essentially, you can also do this:

from collections import Counter
import numpy as np 
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

def freq_dist(data):
    """
    :param data: A string with sentences separated by '\n'
    :type data: str
    """
    ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
    X = ngram_vectorizer.fit_transform(data.split('\n'))
    vocab = list(ngram_vectorizer.get_feature_names())
    counts = X.sum(axis=0).A1
    return Counter(dict(zip(vocab, counts)))

Let's timeit:

import time

start = time.time()
word_distribution = freq_dist(data)
print (time.time() - start)

[out]:

5.257147789001465

Note that CountVectorizer can also take a file instead of a string and there's no need to read the whole file into memory. In code:

import io
from collections import Counter

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

infile = '/path/to/input.txt'

ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)

with io.open(infile, 'r', encoding='utf8') as fin:
    X = ngram_vectorizer.fit_transform(fin)
    vocab = ngram_vectorizer.get_feature_names()
    counts = X.sum(axis=0).A1
    freq_distribution = Counter(dict(zip(vocab, counts)))
    print (freq_distribution.most_common(10))

Solution 3

Here's some benchmark. It'll look strange but the crudest code wins.

[code]:

from collections import Counter, defaultdict
import io, time

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

infile = '/path/to/file'

def extract_dictionary_sklearn(file_path):
    with io.open(file_path, 'r', encoding='utf8') as fin:
        ngram_vectorizer = CountVectorizer(analyzer='word')
        X = ngram_vectorizer.fit_transform(fin)
        vocab = ngram_vectorizer.get_feature_names()
        counts = X.sum(axis=0).A1
    return Counter(dict(zip(vocab, counts)))

def extract_dictionary_native(file_path):
    dictionary = Counter()
    with io.open(file_path, 'r', encoding='utf8') as fin:
        for line in fin:
            dictionary.update(line.split())
    return dictionary

def extract_dictionary_paddle(file_path):
    dictionary = defaultdict(int)
    with io.open(file_path, 'r', encoding='utf8') as fin:
        for line in fin:
            for words in line.split():
                dictionary[word] +=1
    return dictionary

start = time.time()
extract_dictionary_sklearn(infile)
print time.time() - start

start = time.time()
extract_dictionary_native(infile)
print time.time() - start

start = time.time()
extract_dictionary_paddle(infile)
print time.time() - start

[out]:

38.306814909
24.8241138458
12.1182529926

Data size (154MB) used in the benchmark above:

$ wc -c /path/to/file
161680851

$ wc -l /path/to/file
2176141

Some things to note:

With the sklearn version, there's an overhead of vectorizer creation + numpy manipulation and conversion into a Counter object
Then native Counter update version, it seems like Counter.update() is an expensive operation

Solution 4

This should suffice.

def countinfile(filename):
    d = {}
    with open(filename, "r") as fin:
        for line in fin:
            words = line.strip().split()
            for word in words:
                try:
                    d[word] += 1
                except KeyError:
                    d[word] = 1
    return d

Solution 5

Instead of decoding the whole bytes read from the url, I process the binary data. Because bytes.translate expects its second argument to be a byte string, I utf-8 encode punctuation. After removing punctuations, I utf-8 decode the byte string.

The function freq_dist expects an iterable. That's why I've passed data.splitlines().

from urllib2 import urlopen
from collections import Counter
from string import punctuation
from time import time
import sys
from pprint import pprint

url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'

data = urlopen(url).read()

def freq_dist(data):
    """
    :param data: file-like object opened in binary mode or
                 sequence of byte strings separated by '\n'
    :type data: an iterable sequence
    """
    #For readability   
    #return Counter(word for line in data
    #    for word in line.translate(
    #    None,bytes(punctuation.encode('utf-8'))).decode('utf-8').split())

    punc = punctuation.encode('utf-8')
    words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
    return Counter(words)


start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(word_dist.most_common(10))

Output;

elapsed: 0.806480884552

[(u'de', 11106),
 (u'a', 6742),
 (u'que', 5701),
 (u'la', 4319),
 (u'je', 4260),
 (u'se', 3938),
 (u'\u043d\u0430', 3929),
 (u'na', 3623),
 (u'da', 3534),
 (u'i', 3487)]

It seems dict is more efficient than Counter object.

def freq_dist(data):
    """
    :param data: A string with sentences separated by '\n'
    :type data: str
    """
    d = {}
    punc = punctuation.encode('utf-8')
    words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
    for word in words:
        d[word] = d.get(word, 0) + 1
    return d

start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(sorted(word_dist.items(), key=lambda x: (x[1], x[0]), reverse=True)[:10])

Output;

elapsed: 0.642680168152

[(u'de', 11106),
 (u'a', 6742),
 (u'que', 5701),
 (u'la', 4319),
 (u'je', 4260),
 (u'se', 3938),
 (u'\u043d\u0430', 3929),
 (u'na', 3623),
 (u'da', 3534),
 (u'i', 3487)]

To be more memory efficient when opening huge file, you have to pass just the opened url. But the timing will include file download time too.

data = urlopen(url)
word_dist = freq_dist(data)

View more solutions

35,768

Author by

Light Yagmi

I am a researcher in the interdisciplinary area of physics and computer science, although my previous interests were focused on machine learning and Bayesian inference. I've recently been promoted to associate professor, but I don't like to do menial tasks, so I'll probably have to move to some other position to have enough time to my research :P

Updated on January 26, 2020

Comments

Light Yagmi over 4 years
I'd like to count frequencies of all words in a text file.
```
>>> countInFile('test.txt')
```
should return {'aaa':1, 'bbb': 2, 'ccc':1} if the target text file is like:
```
# test.txt
aaa bbb ccc
bbb
```
I've implemented it with pure python following some posts. However, I've found out pure-python ways are insufficient due to huge file size (> 1GB).

I think borrowing sklearn's power is a candidate.

If you let CountVectorizer count frequencies for each line, I guess you will get word frequencies by summing up each column. But, it sounds a bit indirect way.

What is the most efficient and straightforward way to count words in a file with python?

Update

My (very slow) code is here:
```
from collections import Counter

def get_term_frequency_in_file(source_file_path):
    wordcount = {}
    with open(source_file_path) as f:
        for line in f:
            line = line.lower().translate(None, string.punctuation)
            this_wordcount = Counter(line.split())
            wordcount = add_merge_two_dict(wordcount, this_wordcount)
    return wordcount

def add_merge_two_dict(x, y):
    return { k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y) }
```
ShadowRanger about 8 years

FYI, no need to strip() before split-ing when the split isn't given arguments; no arg split already ignores leading and trailing whitespace.
mhawke about 8 years

I've found that (in C-Python) defaultdict(int) is faster than Counter in Python 2, but the other way around in Python 3. BTW, this is an excellent answer. Whatever happened to up-voting on this site?
Light Yagmi about 8 years

Thank you @ShadowRanger. Your code works perfectly! But, please see my previous code in updated question. I've also used Counter. What's wrong in my code?
ShadowRanger about 8 years

@rkjt50r983: Well, among other things, creating many Counters and combining them is a lot more costly than creating one; if you don't like the overly concise code I provided, I'd still suggest creating a single Counter and calling .update on it with the words from each line, which increases the counts for the single Counter in place, rather than creating whole new Counters and combined dicts at each step.
ShadowRanger about 8 years

@mattsap: str.split with no arguments splits on runs of whitespace, and does not return empty groups at the ends when the string begins or ends with whitespace, making it effectively a strip followed by split on runs of whitespace. Also, side-note, the ordering on Windows is \r\n, not \n\r, though str.rstrip is order insensitive, so either order works when you're trying to strip newlines (but no other whitespace) from an input line.
ShadowRanger over 6 years

@mhawke: Late update: I went and checked; as of Python 3.2, Counter has a C-accelerated helper function for updating itself by counting an input iterable, which is almost certainly responsible for the speedup relative to defaultdict(int). I hadn't noticed it because at the time I wrote this, I was looking at 2.7 code (the OP's code uses a 2.x version of str.translate). Nice to know you don't sacrifice any speed for Counter's convenience anymore.