Efficiently count word frequencies in python
Solution 1
The most succinct approach is to use the tools Python gives you.
from future_builtins import map # Only on Python 2
from collections import Counter
from itertools import chain
def countInFile(filename):
with open(filename) as f:
return Counter(chain.from_iterable(map(str.split, f)))
That's it. map(str.split, f)
is making a generator that returns list
s of words from each line. Wrapping in chain.from_iterable
converts that to a single generator that produces a word at a time. Counter
takes an input iterable and counts all unique values in it. At the end, you return
a dict
-like object (a Counter
) that stores all unique words and their counts, and during creation, you only store a line of data at a time and the total counts, not the whole file at once.
In theory, on Python 2.7 and 3.1, you might do slightly better looping over the chained results yourself and using a dict
or collections.defaultdict(int)
to count (because Counter
is implemented in Python, which can make it slower in some cases), but letting Counter
do the work is simpler and more self-documenting (I mean, the whole goal is counting, so use a Counter
). Beyond that, on CPython (the reference interpreter) 3.2 and higher Counter
has a C level accelerator for counting iterable inputs that will run faster than anything you could write in pure Python.
Update: You seem to want punctuation stripped and case-insensitivity, so here's a variant of my earlier code that does that:
from string import punctuation
def countInFile(filename):
with open(filename) as f:
linewords = (line.translate(None, punctuation).lower().split() for line in f)
return Counter(chain.from_iterable(linewords))
Your code runs much more slowly because it's creating and destroying many small Counter
and set
objects, rather than .update
-ing a single Counter
once per line (which, while slightly slower than what I gave in the updated code block, would be at least algorithmically similar in scaling factor).
Solution 2
A memory efficient and accurate way is to make use of
- CountVectorizer in
scikit
(for ngram extraction) - NLTK for
word_tokenize
-
numpy
matrix sum to collect the counts -
collections.Counter
for collecting the counts and vocabulary
An example:
import urllib.request
from collections import Counter
import numpy as np
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
# Our sample textfile.
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
# Note that `ngram_range=(1, 1)` means we want to extract Unigrams, i.e. tokens.
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
# X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary
X = ngram_vectorizer.fit_transform(data.split('\n'))
# Vocabulary
vocab = list(ngram_vectorizer.get_feature_names())
# Column-wise sum of the X matrix.
# It's some crazy numpy syntax that looks horribly unpythonic
# For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
# and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
counts = X.sum(axis=0).A1
freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))
[out]:
[(',', 32000),
('.', 17783),
('de', 11225),
('a', 7197),
('que', 5710),
('la', 4732),
('je', 4304),
('se', 4013),
('на', 3978),
('na', 3834)]
Essentially, you can also do this:
from collections import Counter
import numpy as np
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
def freq_dist(data):
"""
:param data: A string with sentences separated by '\n'
:type data: str
"""
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
X = ngram_vectorizer.fit_transform(data.split('\n'))
vocab = list(ngram_vectorizer.get_feature_names())
counts = X.sum(axis=0).A1
return Counter(dict(zip(vocab, counts)))
Let's timeit
:
import time
start = time.time()
word_distribution = freq_dist(data)
print (time.time() - start)
[out]:
5.257147789001465
Note that CountVectorizer
can also take a file instead of a string and there's no need to read the whole file into memory. In code:
import io
from collections import Counter
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
infile = '/path/to/input.txt'
ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)
with io.open(infile, 'r', encoding='utf8') as fin:
X = ngram_vectorizer.fit_transform(fin)
vocab = ngram_vectorizer.get_feature_names()
counts = X.sum(axis=0).A1
freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))
Solution 3
Here's some benchmark. It'll look strange but the crudest code wins.
[code]:
from collections import Counter, defaultdict
import io, time
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
infile = '/path/to/file'
def extract_dictionary_sklearn(file_path):
with io.open(file_path, 'r', encoding='utf8') as fin:
ngram_vectorizer = CountVectorizer(analyzer='word')
X = ngram_vectorizer.fit_transform(fin)
vocab = ngram_vectorizer.get_feature_names()
counts = X.sum(axis=0).A1
return Counter(dict(zip(vocab, counts)))
def extract_dictionary_native(file_path):
dictionary = Counter()
with io.open(file_path, 'r', encoding='utf8') as fin:
for line in fin:
dictionary.update(line.split())
return dictionary
def extract_dictionary_paddle(file_path):
dictionary = defaultdict(int)
with io.open(file_path, 'r', encoding='utf8') as fin:
for line in fin:
for words in line.split():
dictionary[word] +=1
return dictionary
start = time.time()
extract_dictionary_sklearn(infile)
print time.time() - start
start = time.time()
extract_dictionary_native(infile)
print time.time() - start
start = time.time()
extract_dictionary_paddle(infile)
print time.time() - start
[out]:
38.306814909
24.8241138458
12.1182529926
Data size (154MB) used in the benchmark above:
$ wc -c /path/to/file
161680851
$ wc -l /path/to/file
2176141
Some things to note:
- With the
sklearn
version, there's an overhead of vectorizer creation + numpy manipulation and conversion into aCounter
object - Then native
Counter
update version, it seems likeCounter.update()
is an expensive operation
Solution 4
This should suffice.
def countinfile(filename):
d = {}
with open(filename, "r") as fin:
for line in fin:
words = line.strip().split()
for word in words:
try:
d[word] += 1
except KeyError:
d[word] = 1
return d
Solution 5
Instead of decoding the whole bytes read from the url, I process the binary data. Because bytes.translate
expects its second argument to be a byte string, I utf-8 encode punctuation
. After removing punctuations, I utf-8 decode the byte string.
The function freq_dist
expects an iterable. That's why I've passed data.splitlines()
.
from urllib2 import urlopen
from collections import Counter
from string import punctuation
from time import time
import sys
from pprint import pprint
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
data = urlopen(url).read()
def freq_dist(data):
"""
:param data: file-like object opened in binary mode or
sequence of byte strings separated by '\n'
:type data: an iterable sequence
"""
#For readability
#return Counter(word for line in data
# for word in line.translate(
# None,bytes(punctuation.encode('utf-8'))).decode('utf-8').split())
punc = punctuation.encode('utf-8')
words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
return Counter(words)
start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(word_dist.most_common(10))
Output;
elapsed: 0.806480884552
[(u'de', 11106),
(u'a', 6742),
(u'que', 5701),
(u'la', 4319),
(u'je', 4260),
(u'se', 3938),
(u'\u043d\u0430', 3929),
(u'na', 3623),
(u'da', 3534),
(u'i', 3487)]
It seems dict
is more efficient than Counter
object.
def freq_dist(data):
"""
:param data: A string with sentences separated by '\n'
:type data: str
"""
d = {}
punc = punctuation.encode('utf-8')
words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
for word in words:
d[word] = d.get(word, 0) + 1
return d
start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(sorted(word_dist.items(), key=lambda x: (x[1], x[0]), reverse=True)[:10])
Output;
elapsed: 0.642680168152
[(u'de', 11106),
(u'a', 6742),
(u'que', 5701),
(u'la', 4319),
(u'je', 4260),
(u'se', 3938),
(u'\u043d\u0430', 3929),
(u'na', 3623),
(u'da', 3534),
(u'i', 3487)]
To be more memory efficient when opening huge file, you have to pass just the opened url. But the timing will include file download time too.
data = urlopen(url)
word_dist = freq_dist(data)
Light Yagmi
I am a researcher in the interdisciplinary area of physics and computer science, although my previous interests were focused on machine learning and Bayesian inference. I've recently been promoted to associate professor, but I don't like to do menial tasks, so I'll probably have to move to some other position to have enough time to my research :P
Updated on January 26, 2020Comments
-
Light Yagmi over 4 years
I'd like to count frequencies of all words in a text file.
>>> countInFile('test.txt')
should return
{'aaa':1, 'bbb': 2, 'ccc':1}
if the target text file is like:# test.txt aaa bbb ccc bbb
I've implemented it with pure python following some posts. However, I've found out pure-python ways are insufficient due to huge file size (> 1GB).
I think borrowing sklearn's power is a candidate.
If you let CountVectorizer count frequencies for each line, I guess you will get word frequencies by summing up each column. But, it sounds a bit indirect way.
What is the most efficient and straightforward way to count words in a file with python?
Update
My (very slow) code is here:
from collections import Counter def get_term_frequency_in_file(source_file_path): wordcount = {} with open(source_file_path) as f: for line in f: line = line.lower().translate(None, string.punctuation) this_wordcount = Counter(line.split()) wordcount = add_merge_two_dict(wordcount, this_wordcount) return wordcount def add_merge_two_dict(x, y): return { k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y) }
-
ShadowRanger about 8 yearsFYI, no need to
strip()
beforesplit
-ing when thesplit
isn't given arguments; no argsplit
already ignores leading and trailing whitespace. -
mhawke about 8 yearsI've found that (in C-Python)
defaultdict(int)
is faster thanCounter
in Python 2, but the other way around in Python 3. BTW, this is an excellent answer. Whatever happened to up-voting on this site? -
Light Yagmi about 8 yearsThank you @ShadowRanger. Your code works perfectly! But, please see my previous code in updated question. I've also used
Counter
. What's wrong in my code? -
ShadowRanger about 8 years@rkjt50r983: Well, among other things, creating many
Counter
s and combining them is a lot more costly than creating one; if you don't like the overly concise code I provided, I'd still suggest creating a singleCounter
and calling.update
on it with the words from each line, which increases the counts for the singleCounter
in place, rather than creating whole newCounter
s and combineddict
s at each step. -
ShadowRanger about 8 years@mattsap:
str.split
with no arguments splits on runs of whitespace, and does not return empty groups at the ends when the string begins or ends with whitespace, making it effectively astrip
followed bysplit
on runs of whitespace. Also, side-note, the ordering on Windows is\r\n
, not\n\r
, thoughstr.rstrip
is order insensitive, so either order works when you're trying to strip newlines (but no other whitespace) from an input line. -
ShadowRanger over 6 years@mhawke: Late update: I went and checked; as of Python 3.2,
Counter
has a C-accelerated helper function for updating itself by counting an input iterable, which is almost certainly responsible for the speedup relative todefaultdict(int)
. I hadn't noticed it because at the time I wrote this, I was looking at 2.7 code (the OP's code uses a 2.x version ofstr.translate
). Nice to know you don't sacrifice any speed forCounter
's convenience anymore.