Python: how to make an histogram with equally sized bins

python histogram spacing binning

14,616

Solution 1

Using your example case (bins of 2 points, 6 total data points):

from scipy import stats
bin_edges = stats.mstats.mquantiles(data, [0, 2./6, 4./6, 1])
>> array([1. , 1.24666667, 2.05333333, 2.12])

Solution 2

I would like to mention also the existence of pandas.qcut, which does equi-populated binning in quite an efficient way. In your case it would work something like

data = np.array([1., 1.2, 1.3, 2.0, 2.1, 2.12])
# parameter q specifies the number of bins
qc = pd.qcut(data, q=3, precision=1)

# bin definition
bins  = qc.categories
print(bins)
>> Index(['[1, 1.3]', '(1.3, 2.03]', '(2.03, 2.1]'], dtype='object')

# bin corresponding to each point in data
codes = qc.codes
print(codes)
>> array([0, 0, 1, 1, 2, 2], dtype=int8)

Solution 3

Update for skewed distributions :

I came across the same problem as @astabada, wanting to create bins each containing an equal number of samples. When applying the solution proposed @aganders3, I found that it didn't work particularly well for skewed distributions. In the case of skewed data (for example something with a whole lot of zeros), stats.mstats.mquantiles for a predefined number of quantiles will not guarantee an equal number of samples in each bin. You will get bin edges that look like this :

[0. 0. 4. 9.]

In which case the first bin will be empty.

In order to deal with skewed cases, I created a function that calls stats.mstats.mquantiles and then dynamically modifies the number of bins if samples are not equal within a certain tolerance (30% of the smallest sample size in the example code). If samples are not equal between bins, the code reduces the number of equally-spaced quantiles by 1 and calls stats.mstats.mquantiles again until sample sizes are equal or only one bin exists.

I hard coded the tolerance in the example, but this could be modified to a keyword argument if desired.

I also prefer giving the number of equally spaced quantiles as an argument to my function instead of giving user defined quantiles to stats.mstats.mquantiles in order to reduce accidental errors (i.e. something like [0., 0.25, 0.7, 1.]).

Here's the code :

import numpy as np 
from scipy import stats

def equibins(dat, binnum, **kwargs):
    numin = binnum
    while numin>1.:
        qtls = np.linspace(0.,1.0,num=numin,endpoint=False)
        ebins =stats.mstats.mquantiles(dat,qtls,alphap=kwargs['alpha'],betap=kwargs['beta'])
        allhist, allbin   = np.histogram(dat, bins = ebins)
        if (np.unique(ebins).shape!=ebins.shape or tolerence(allhist,0.3)==False) and numin>2:
            numin= numin-1
            del qtls, ebins
        else:
            numin=0
    return ebins

def tolerence(narray, percent):
    if percent>1.0:
        per = percent/100.
    else:
        per = percent
    lev_tol  = per*narray.min()
    tolerate = np.all(narray[1:]-narray[0]<lev_tol)
    return tolerate

Solution 4

Just sort the data, and divide it into fixed bins by length! Obviously you can never divide into exactly equally populated bins, if the number of samples does not divide exactly by the number of bins.

import math
import numpy as np
data = np.array([2,3,5,6,8,5,5,6,3,2,3,7,8,9,8,6,6,8,9,9,0,7,5,3,3,4,5,6,7])
data_sorted = np.sort(data)
nbins = 3
step = math.ceil(len(data_sorted)//nbins+1)
binned_data = []
for i in range(0,len(data_sorted),step):
    binned_data.append(data_sorted[i:i+step])

View more solutions

14,616

Author by

astabada

My main occupation is to debug cruft I write in my free time. That is, after struggling to stay alive.

Updated on June 06, 2022

Comments

astabada about 2 years
I have a set of data, and want to make an histogram of it. I need the bins to have the same size, by which I mean that they must contain the same number of objects, rather than the more common (numpy.histogram) problem of having equally spaced bins. This will naturally come at the expenses of the bins widths, which can - and in general will - be different.

I will specify the number of desired bins and the data set, obtaining the bins edges in return.
```
Example:
data = numpy.array([1., 1.2, 1.3, 2.0, 2.1, 2.12])
bins_edges = somefunc(data, nbins=3)
print(bins_edges)
>> [1.,1.3,2.1,2.12]
```
So the bins all contain 2 points, but their widths (0.3, 0.8, 0.02) are different.

There are two limitations: - if a group of data is identical, the bin containing them could be bigger. - if there are N data and M bins are requested, there will be N/M bins plus one if N%M is not 0.

This piece of code is some cruft I've written, which worked nicely for small data sets. What if I have 10**9+ points and want to speed up the process?
```
  1 import numpy as np
  2 
  3 def def_equbin(in_distr, binsize=None, bin_num=None):
  4 
  5     try:
  6 
  7         distr_size = len(in_distr)
  8 
  9         bin_size = distr_size / bin_num
 10         odd_bin_size = distr_size % bin_num
 11 
 12         args = in_distr.argsort()
 13 
 14         hist = np.zeros((bin_num, bin_size))
 15 
 16         for i in range(bin_num):
 17             hist[i, :] = in_distr[args[i * bin_size: (i + 1) * bin_size]]
 18 
 19         if odd_bin_size == 0:
 20             odd_bin = None
 21             bins_limits = np.arange(bin_num) * bin_size
 22             bins_limits = args[bins_limits]
 23             bins_limits = np.concatenate((in_distr[bins_limits],
 24                                           [in_distr[args[-1]]]))
 25         else:
 26             odd_bin = in_distr[args[bin_num * bin_size:]]
 27             bins_limits = np.arange(bin_num + 1) * bin_size
 28             bins_limits = args[bins_limits]
 29             bins_limits = in_distr[bins_limits]
 30             bins_limits = np.concatenate((bins_limits, [in_distr[args[-1]]]))
 31 
 32         return (hist, odd_bin, bins_limits)
```
- aganders3 over 11 years
  
  I may not be understanding this correctly, but it sounds like you will end up with a very boring (e.g. completely flat) histogram this way. Are you just looking to find some quantiles of the data?
- astabada over 11 years
  
  Hi, you understood it correctly. Because each value is the magnitude of a galaxy, I will be able then to look at how other properties behave in each separate bin!
- ezod over 11 years
  
  This sounds more like quantiles than a histogram.
astabada over 11 years

Ah ah, brilliant! I did not know they were called quantiles, so I spent a lot of time googling "equally spaced bins" and similar... Thanks a lot!
mrchampe over 11 years

Isn't it just dandy when you learn a new term, then all of a sudden it seems google starts to work again? Happens to me all the time.
jimh over 7 years

@mrchampe the dandiest.
Darina over 3 years

In recent Pandas versions, instead of qc.categories and qc.codes you need to use qc.cat.categories and qc.cat.codes.