How to get data in a histogram bin

38,519

Solution 1

digitize, from core NumPy, will give you the index of the bin to which each value in your histogram belongs:

import numpy as NP
A = NP.random.randint(0, 10, 100)

bins = NP.array([0., 20., 40., 60., 80., 100.])

# d is an index array holding the bin id for each point in A
d = NP.digitize(A, bins)     

Solution 2

how about something like:

data = numpy.array([0, 0.5, 1.5, 1.5, 1.5, 2.5, 2.5, 2.5, 3])

hist, edges = numpy.histogram(data, bins=3)

for l, r in zip(edges[:-1], edges[1:]):
   print(data[(data > l) & (data < r)]) 

Out:

[ 0.5]
[ 1.5  1.5  1.5]
[ 2.5  2.5  2.5]

with a bit of code to handle the edge cases.

Share:
38,519
Ben
Author by

Ben

I work at iRobot where I am developing the next generation of consumer robots. However, my postings on this site are my own and don't necessarily represent iRobot's positions, strategies, or opinions.

Updated on December 08, 2020

Comments

  • Ben
    Ben over 3 years

    I want to get a list of the data contained in a histogram bin. I am using numpy, and Matplotlib. I know how to traverse the data and check the bin edges. However, I want to do this for a 2D histogram and the code to do this is rather ugly. Does numpy have any constructs to make this easier?

    For the 1D case, I can use searchsorted(). But the logic is not that much better, and I don’t really want to do a binary search on each data point when I don’t have to.

    Most of the nasty logic is due to the bin boundary regions. All regions have boundaries like this: [left edge, right edge). Except the last bin, which has a region like this: [left edge, right edge].

    Here is some sample code for the 1D case:

    import numpy as np
    
    data = [0, 0.5, 1.5, 1.5, 1.5, 2.5, 2.5, 2.5, 3]
    
    hist, edges = np.histogram(data, bins=3)
    
    print 'data =', data
    print 'histogram =', hist
    print 'edges =', edges
    
    getbin = 2  #0, 1, or 2
    
    print '---'
    print 'alg 1:'
    
    #for i in range(len(data)):
    for d in data:
        if d >= edges[getbin]:
            if (getbin == len(edges)-2) or d < edges[getbin+1]:
                print 'found:', d
            #end if
        #end if
    #end for
    
    print '---'
    print 'alg 2:'
    
    for d in data:
        val = np.searchsorted(edges, d, side='right')-1
        if val == getbin or val == len(edges)-1:
            print 'found:', d
        #end if
    #end for
    

    Here is some sample code for the 2D case:

    import numpy as np
    
    xdata = [0, 1.5, 1.5, 2.5, 2.5, 2.5, \
             0.5, 0.5, 0.5, 0.5, 1.5, 1.5, 1.5, 1.5, 1.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, \
             0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 3]
    ydata = [0, 5,5, 5, 5, 5, \
             15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, \
             25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 30]
    
    xbins = 3
    ybins = 3
    hist2d, xedges, yedges = np.histogram2d(xdata, ydata, bins=(xbins, ybins))
    
    print 'data2d =', zip(xdata, ydata)
    print 'hist2d ='
    print hist2d
    print 'xedges =', xedges
    print 'yedges =', yedges
    
    getbin2d = 5  #0 through 8
    
    print 'find data in bin #', getbin2d
    
    xedge_i = getbin2d % xbins
    yedge_i = int(getbin2d / xbins) #IMPORTANT: this is xbins
    
    for x, y in zip(xdata, ydata):
        # x and y left edges
        if x >= xedges[xedge_i] and y >= yedges[yedge_i]:
            #x right edge
            if xedge_i == xbins-1 or x < xedges[xedge_i + 1]:
                #y right edge
                if yedge_i == ybins-1 or y < yedges[yedge_i + 1]:
                    print 'found:', x, y
                #end if
            #end if
        #end if
    #end for
    

    Is there a cleaner / more efficient way to do this? It seems like numpy would have something for this.

  • Ben
    Ben about 14 years
    This is almost perfect! If there are any numpy devs here, this function should really go in the "see also" part of the histogram docs. It is too bad that the digitize() bin logic doesn't exactly match the histogram() bin logic though. So this leads to just as awkward code as the rest of my examples above.
  • Bi Rico
    Bi Rico about 12 years
    isn't this exactly the same as bins.searchsorted(A, 'right')?