Most efficient way to find mode in numpy array
Solution 1
Check scipy.stats.mode()
(inspired by @tom10's comment):
import numpy as np
from scipy import stats
a = np.array([[1, 3, 4, 2, 2, 7],
[5, 2, 2, 1, 4, 1],
[3, 3, 2, 2, 1, 1]])
m = stats.mode(a)
print(m)
Output:
ModeResult(mode=array([[1, 3, 2, 2, 1, 1]]), count=array([[1, 2, 2, 2, 1, 2]]))
As you can see, it returns both the mode as well as the counts. You can select the modes directly via m[0]
:
print(m[0])
Output:
[[1 3 2 2 1 1]]
Solution 2
Update
The scipy.stats.mode
function has been significantly optimized since this post, and would be the recommended method
Old answer
This is a tricky problem, since there is not much out there to calculate mode along an axis. The solution is straight forward for 1-D arrays, where numpy.bincount
is handy, along with numpy.unique
with the return_counts
arg as True
. The most common n-dimensional function I see is scipy.stats.mode, although it is prohibitively slow- especially for large arrays with many unique values. As a solution, I've developed this function, and use it heavily:
import numpy
def mode(ndarray, axis=0):
# Check inputs
ndarray = numpy.asarray(ndarray)
ndim = ndarray.ndim
if ndarray.size == 1:
return (ndarray[0], 1)
elif ndarray.size == 0:
raise Exception('Cannot compute mode on empty array')
try:
axis = range(ndarray.ndim)[axis]
except:
raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))
# If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
if all([ndim == 1,
int(numpy.__version__.split('.')[0]) >= 1,
int(numpy.__version__.split('.')[1]) >= 9]):
modals, counts = numpy.unique(ndarray, return_counts=True)
index = numpy.argmax(counts)
return modals[index], counts[index]
# Sort array
sort = numpy.sort(ndarray, axis=axis)
# Create array to transpose along the axis and get padding shape
transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
shape = list(sort.shape)
shape[axis] = 1
# Create a boolean array along strides of unique values
strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
numpy.diff(sort, axis=axis) == 0,
numpy.zeros(shape=shape, dtype='bool')],
axis=axis).transpose(transpose).ravel()
# Count the stride lengths
counts = numpy.cumsum(strides)
counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
counts[strides] = 0
# Get shape of padded counts and slice to return to the original shape
shape = numpy.array(sort.shape)
shape[axis] += 1
shape = shape[transpose]
slices = [slice(None)] * ndim
slices[axis] = slice(1, None)
# Reshape and compute final counts
counts = counts.reshape(shape).transpose(transpose)[slices] + 1
# Find maximum counts and return modals/counts
slices = [slice(None, i) for i in sort.shape]
del slices[axis]
index = numpy.ogrid[slices]
index.insert(axis, numpy.argmax(counts, axis=axis))
return sort[index], counts[index]
Result:
In [2]: a = numpy.array([[1, 3, 4, 2, 2, 7],
[5, 2, 2, 1, 4, 1],
[3, 3, 2, 2, 1, 1]])
In [3]: mode(a)
Out[3]: (array([1, 3, 2, 2, 1, 1]), array([1, 2, 2, 2, 1, 2]))
Some benchmarks:
In [4]: import scipy.stats
In [5]: a = numpy.random.randint(1,10,(1000,1000))
In [6]: %timeit scipy.stats.mode(a)
10 loops, best of 3: 41.6 ms per loop
In [7]: %timeit mode(a)
10 loops, best of 3: 46.7 ms per loop
In [8]: a = numpy.random.randint(1,500,(1000,1000))
In [9]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 1.01 s per loop
In [10]: %timeit mode(a)
10 loops, best of 3: 80 ms per loop
In [11]: a = numpy.random.random((200,200))
In [12]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 3.26 s per loop
In [13]: %timeit mode(a)
1000 loops, best of 3: 1.75 ms per loop
EDIT: Provided more of a background and modified the approach to be more memory-efficient
Solution 3
If you want to use numpy only:
x = [-1, 2, 1, 3, 3]
vals,counts = np.unique(x, return_counts=True)
gives
(array([-1, 1, 2, 3]), array([1, 1, 1, 2]))
And extract it:
index = np.argmax(counts)
return vals[index]
Solution 4
A neat solution that only uses numpy
(not scipy
nor the Counter
class):
A = np.array([[1,3,4,2,2,7], [5,2,2,1,4,1], [3,3,2,2,1,1]])
np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=A)
array([1, 3, 2, 2, 1, 1])
Solution 5
Expanding on this method, applied to finding the mode of the data where you may need the index of the actual array to see how far away the value is from the center of the distribution.
(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]
Remember to discard the mode when len(np.argmax(counts)) > 1, also to validate if it is actually representative of the central distribution of your data you may check whether it falls inside your standard deviation interval.
Related videos on Youtube
Nik
I am a grad student studying Computer Science. My interests are in data-driven applications, web technologies and machine learning. I love coding and am a big fan of Python even though I started using Python only a year ago after being a long standing fan of C/C++ (and Matlab, for research).
Updated on July 13, 2022Comments
-
Nik almost 2 years
I have a 2D array containing integers (both positive or negative). Each row represents the values over time for a particular spatial site, whereas each column represents values for various spatial sites for a given time.
So if the array is like:
1 3 4 2 2 7 5 2 2 1 4 1 3 3 2 2 1 1
The result should be
1 3 2 2 2 1
Note that when there are multiple values for mode, any one (selected randomly) may be set as mode.
I can iterate over the columns finding mode one at a time but I was hoping numpy might have some in-built function to do that. Or if there is a trick to find that efficiently without looping.
-
tom10 about 11 yearsThere is docs.scipy.org/doc/scipy/reference/generated/… and the answer here: stackoverflow.com/questions/6252280/…
-
fgb about 11 years@tom10: You mean scipy.stats.mode(), right? The other one seems to output a masked array.
-
tom10 about 11 years@fgb: right, thanks for the correction (and +1 for your answer).
-
-
Nik about 11 yearsSo numpy by itself does not support any such functionality?
-
fgb about 11 yearsApparently not, but scipy's implementation relies only on numpy, so you could just copy that code into your own function.
-
ffledgling almost 11 yearsJust a note, for people who look at this in the future: you need to
import scipy.stats
explicitly, it is not included when you simply do animport scipy
. -
Rahul over 6 yearsCan you please explain how exactly it is displaying the mode values and count ? I couldn't relate the output with the input provided.
-
fgb over 6 years@Osgux: what doesn't work? I just re-run the code in the answer against python 2.7.14, numpy 1.13.3, and scipy 1.0.0 without a problem.
-
fgb over 6 years@Rahul: I'm not sure I follow. The output is a ModeResult object with mode and count arrays displayed above.
-
marti_ over 6 years@fgb sorry It was a problem on my laptop because I have 2 version of python =/
-
fgb over 6 years@Osgux: no worries; glad you could resolve the issue!
-
Rahul over 6 years@fgb: we actually think mode as the frequency of a value in a given set of values? I am not getting how output is being displayed. Let's say for example, if it is 1D-array [1,2,2,4,5] then the mode will be 2. But in 2D-array as in above example its printing "mode=array([[1, 3, 2, 2, 1, 1]])". Why these many "1s or 2s" ? How it derives the output ?
-
fgb over 6 years@Rahul: you have to consider the default second argument of
axis=0
. The above code is reporting the mode per column of the input. The count is telling us how many times it has seen the reported mode in each of the columns. If you wanted the overall mode, you need to specifyaxis=None
. For further info, please refer to docs.scipy.org/doc/scipy/reference/generated/… -
Rahul over 6 years@fgb Thanks for the explanation. Understood it clearly now.
-
Achal Dave over 5 yearsUntil github.com/scipy/scipy/pull/8294,
scipy.stats.mode
was very slow for some cases, to the point that the more generalfind_repeats
can be faster: github.com/scipy/scipy/issues/3035 -
ARF over 5 yearsPlease do contribute it to scipy's stat module so others also could benefit from it.
-
Zeliha Bektas almost 5 yearsSince the question was asked 6 years ago, it is normal that he did not receive much reputation.
-
loganjones16 over 4 yearsWhen does np.argmax ever return something with length greater than 1 if you don't specify an axis?
-
scottlittle over 4 yearsNice and concise, but should be used with caution if the original arrays contain a very large number because bincount will create bin arrays with len( max(A[i]) ) for each original array A[i].
-
CheshireCat about 4 yearsFor higher dimensional problems with big int ndarrays, your solution seems to be still much faster than scipy.stats.mode. I had to compute the mode along the first axis of a 4x250x250x500 ndarray, and your function took 10s, while scipy.stats.mode took almost 600s.
-
Christopher almost 4 yearsThis is an awesome solution. There is actually a drawback in
scipy.stats.mode
. When there are multiple values having the most occurrence (multiple modes), it will throw an expectation. But this method will automatically take the "first mode". -
Christopher almost 4 yearsLike this method because it supports not only integers, but also float and even strings!
-
William Abma over 3 yearsI concur with the comment above. Your function is still faster than scipy's implementation for larger matrices (though the performance I get from scipy is way better than 600s for me).
-
Chris over 2 yearsfor those who want to avoid the debug cycle triggered by the over-OOP'd return type,
scipy.stats.mode(arr).mode[0]
is the answer. That is: the mode is found with the callmode(arr).mode[0]
, but you might have to catchValueError
s for zero length arrays and wrap in your ownmode
function, so you'll have to alias the scipy mode or import stats and call it from there. -
John Henckel about 2 yearsThank you. why is this not the TOP answer?