counting the unique items in a numpy array: why is scipy.stats.itemfreq so slow?
Solution 1
If you can use numpy 1.9, you can use numpy.unique
with the argument return_counts=True
. I.e.
unique_items, counts = np.unique(x, return_counts=True)
In fact, itemfreq
was updated to use np.unique
, but scipy currently supports numpy versions back to 1.5, so it doesn't use the return_counts
argument.
Here's the complete implementation of itemfreq
in scipy 0.14:
def itemfreq(a):
items, inv = np.unique(a, return_inverse=True)
freq = np.bincount(inv)
return np.array([items, freq]).T
Solution 2
First, time.time
is the wrong function to use when timing, as it measures wall-clock time, not cpu time (see this question). Ideally you would use the timeit
module, but time.clock
is also better.
Also, it seems that you might be using an outdated version of scipy. I'm using Python 3.4 and scipy 0.14.0 and these are my timings:
x = np.tile([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 20000)
for i in [44, 22, 300, 403, 777, 1009, 800]:
x[i] = 11
%timeit getCounts(x)
# 10 loops, best of 3: 55.6 ms per loop
%timeit scipy.stats.itemfreq(x)
# 10 loops, best of 3: 20.8 ms per loop
%timeit collections.Counter(x)
# 10 loops, best of 3: 39.9 ms per loop
%timeit np.unique(x, return_counts=True)
# 100 loops, best of 3: 4.13 ms per loop
Solution 3
For the lazy:
import pandas as pd
pd.Series( my_list_or_array ).nunique()
Jason S
Updated on July 27, 2022Comments
-
Jason S almost 2 years
I'm trying to count the unique values in a numpy array.
import numpy as np from collections import defaultdict import scipy.stats import time x = np.tile([1,2,3,4,5,6,7,8,9,10],20000) for i in [44,22,300,403,777,1009,800]: x[i] = 11 def getCounts(x): counts = defaultdict(int) for item in x: counts[item] += 1 return counts flist = [getCounts, scipy.stats.itemfreq] for f in flist: print f t1 = time.time() y = f(x) t2 = time.time() print y print '%.5f sec' % (t2-t1)
I couldn't find a builtin function at first to do this, so I wrote
getCounts()
; then I foundscipy.stats.itemfreq
so thought I would use that instead. But it's slow! Here's what I get on my PC. Why is it so slow compared to such a simple handwritten function?<function getCounts at 0x0000000013C78438> defaultdict(<type 'int'>, {1: 19998, 2: 20000, 3: 19999, 4: 19999, 5: 19999, 6: 20000, 7: 20000, 8: 19999, 9: 20000, 10: 19999, 11: 7}) 0.04700 sec <function itemfreq at 0x0000000013C5D208> [[ 1.00000000e+00 1.99980000e+04] [ 2.00000000e+00 2.00000000e+04] [ 3.00000000e+00 1.99990000e+04] [ 4.00000000e+00 1.99990000e+04] [ 5.00000000e+00 1.99990000e+04] [ 6.00000000e+00 2.00000000e+04] [ 7.00000000e+00 2.00000000e+04] [ 8.00000000e+00 1.99990000e+04] [ 9.00000000e+00 2.00000000e+04] [ 1.00000000e+01 1.99990000e+04] [ 1.10000000e+01 7.00000000e+00]] 2.04100 sec
-
Roger Fan over 9 yearsI never realized the speedup the numpy function has. Makes you wonder why it isn't used in the scipy function.
-
Jason S over 9 yearsActually I can just use
np.bincount
in my application since I have a large array of small integers.