Python: how to make an histogram with equally *sized* bins
Solution 1
Using your example case (bins of 2 points, 6 total data points):
from scipy import stats
bin_edges = stats.mstats.mquantiles(data, [0, 2./6, 4./6, 1])
>> array([1. , 1.24666667, 2.05333333, 2.12])
Solution 2
I would like to mention also the existence of pandas.qcut
, which does equi-populated binning in quite an efficient way. In your case it would work something like
data = np.array([1., 1.2, 1.3, 2.0, 2.1, 2.12])
# parameter q specifies the number of bins
qc = pd.qcut(data, q=3, precision=1)
# bin definition
bins = qc.categories
print(bins)
>> Index(['[1, 1.3]', '(1.3, 2.03]', '(2.03, 2.1]'], dtype='object')
# bin corresponding to each point in data
codes = qc.codes
print(codes)
>> array([0, 0, 1, 1, 2, 2], dtype=int8)
Solution 3
Update for skewed distributions :
I came across the same problem as @astabada, wanting to create bins each containing an equal number of samples. When applying the solution proposed @aganders3, I found that it didn't work particularly well for skewed distributions. In the case of skewed data (for example something with a whole lot of zeros), stats.mstats.mquantiles
for a predefined number of quantiles will not guarantee an equal number of samples in each bin. You will get bin edges that look like this :
[0. 0. 4. 9.]
In which case the first bin will be empty.
In order to deal with skewed cases, I created a function that calls stats.mstats.mquantiles
and then dynamically modifies the number of bins if samples are not equal within a certain tolerance (30% of the smallest sample size in the example code). If samples are not equal between bins, the code reduces the number of equally-spaced quantiles by 1 and calls stats.mstats.mquantiles
again until sample sizes are equal or only one bin exists.
I hard coded the tolerance in the example, but this could be modified to a keyword argument if desired.
I also prefer giving the number of equally spaced quantiles as an argument to my function instead of giving user defined quantiles to stats.mstats.mquantiles
in order to reduce accidental errors (i.e. something like [0., 0.25, 0.7, 1.]
).
Here's the code :
import numpy as np
from scipy import stats
def equibins(dat, binnum, **kwargs):
numin = binnum
while numin>1.:
qtls = np.linspace(0.,1.0,num=numin,endpoint=False)
ebins =stats.mstats.mquantiles(dat,qtls,alphap=kwargs['alpha'],betap=kwargs['beta'])
allhist, allbin = np.histogram(dat, bins = ebins)
if (np.unique(ebins).shape!=ebins.shape or tolerence(allhist,0.3)==False) and numin>2:
numin= numin-1
del qtls, ebins
else:
numin=0
return ebins
def tolerence(narray, percent):
if percent>1.0:
per = percent/100.
else:
per = percent
lev_tol = per*narray.min()
tolerate = np.all(narray[1:]-narray[0]<lev_tol)
return tolerate
Solution 4
Just sort the data, and divide it into fixed bins by length! Obviously you can never divide into exactly equally populated bins, if the number of samples does not divide exactly by the number of bins.
import math
import numpy as np
data = np.array([2,3,5,6,8,5,5,6,3,2,3,7,8,9,8,6,6,8,9,9,0,7,5,3,3,4,5,6,7])
data_sorted = np.sort(data)
nbins = 3
step = math.ceil(len(data_sorted)//nbins+1)
binned_data = []
for i in range(0,len(data_sorted),step):
binned_data.append(data_sorted[i:i+step])
astabada
My main occupation is to debug cruft I write in my free time. That is, after struggling to stay alive.
Updated on June 06, 2022Comments
-
astabada about 2 years
I have a set of data, and want to make an histogram of it. I need the bins to have the same size, by which I mean that they must contain the same number of objects, rather than the more common (numpy.histogram) problem of having equally spaced bins. This will naturally come at the expenses of the bins widths, which can - and in general will - be different.
I will specify the number of desired bins and the data set, obtaining the bins edges in return.
Example: data = numpy.array([1., 1.2, 1.3, 2.0, 2.1, 2.12]) bins_edges = somefunc(data, nbins=3) print(bins_edges) >> [1.,1.3,2.1,2.12]
So the bins all contain 2 points, but their widths (0.3, 0.8, 0.02) are different.
There are two limitations: - if a group of data is identical, the bin containing them could be bigger. - if there are N data and M bins are requested, there will be N/M bins plus one if N%M is not 0.
This piece of code is some cruft I've written, which worked nicely for small data sets. What if I have 10**9+ points and want to speed up the process?
1 import numpy as np 2 3 def def_equbin(in_distr, binsize=None, bin_num=None): 4 5 try: 6 7 distr_size = len(in_distr) 8 9 bin_size = distr_size / bin_num 10 odd_bin_size = distr_size % bin_num 11 12 args = in_distr.argsort() 13 14 hist = np.zeros((bin_num, bin_size)) 15 16 for i in range(bin_num): 17 hist[i, :] = in_distr[args[i * bin_size: (i + 1) * bin_size]] 18 19 if odd_bin_size == 0: 20 odd_bin = None 21 bins_limits = np.arange(bin_num) * bin_size 22 bins_limits = args[bins_limits] 23 bins_limits = np.concatenate((in_distr[bins_limits], 24 [in_distr[args[-1]]])) 25 else: 26 odd_bin = in_distr[args[bin_num * bin_size:]] 27 bins_limits = np.arange(bin_num + 1) * bin_size 28 bins_limits = args[bins_limits] 29 bins_limits = in_distr[bins_limits] 30 bins_limits = np.concatenate((bins_limits, [in_distr[args[-1]]])) 31 32 return (hist, odd_bin, bins_limits)
-
aganders3 over 11 yearsI may not be understanding this correctly, but it sounds like you will end up with a very boring (e.g. completely flat) histogram this way. Are you just looking to find some quantiles of the data?
-
astabada over 11 yearsHi, you understood it correctly. Because each value is the magnitude of a galaxy, I will be able then to look at how other properties behave in each separate bin!
-
ezod over 11 yearsThis sounds more like quantiles than a histogram.
-
-
astabada over 11 yearsAh ah, brilliant! I did not know they were called quantiles, so I spent a lot of time googling "equally spaced bins" and similar... Thanks a lot!
-
mrchampe over 11 yearsIsn't it just dandy when you learn a new term, then all of a sudden it seems google starts to work again? Happens to me all the time.
-
jimh over 7 years@mrchampe the dandiest.
-
Darina over 3 yearsIn recent Pandas versions, instead of
qc.categories
andqc.codes
you need to useqc.cat.categories
andqc.cat.codes
.