probability density function from histogram in python to fit another histrogram

12,532

You can use a cumulative density function to generate random numbers from an arbitrary distribution, as described here.

Using a histogram to produce a smooth cumulative density function is not entirely trivial; you can use interpolation for example scipy.interpolate.interp1d() for values in between the centers of your bins and that will work fine for a histogram with a reasonably large number of bins and items. However you have to decide on the form of the tails of the probability function, ie for values less than the smallest bin or greater than the largest bin. You could give your distribution gaussian tails based on for example fitting a gaussian to your histogram), or any other form of tail appropriate to your problem, or simply truncate the distribution.

Example:

import numpy
import scipy.interpolate
import random
import matplotlib.pyplot as pyplot

# create some normally distributed values and make a histogram
a = numpy.random.normal(size=10000)
counts, bins = numpy.histogram(a, bins=100, density=True)
cum_counts = numpy.cumsum(counts)
bin_widths = (bins[1:] - bins[:-1])

# generate more values with same distribution
x = cum_counts*bin_widths
y = bins[1:]
inverse_density_function = scipy.interpolate.interp1d(x, y)
b = numpy.zeros(10000)
for i in range(len( b )):
    u = random.uniform( x[0], x[-1] )
    b[i] = inverse_density_function( u )

# plot both        
pyplot.hist(a, 100) 
pyplot.hist(b, 100)
pyplot.show()

This doesn't handle tails, and it could handle bin edges better, but it would get you started on using a histogram to generate more values with the same distribution.

P.S. You could also try to fit a specific known distribution described by a few values (which I think is what you had mentioned in the question) but the above non-parametric approach is more general-purpose.

Share:
12,532
madzone
Author by

madzone

Updated on June 23, 2022

Comments

  • madzone
    madzone over 1 year

    I have a question concerning fitting and getting random numbers.

    Situation is as such:

    Firstly I have a histogram from data points.

    import numpy as np
    
    """create random data points """
    mu = 10
    sigma = 5
    n = 1000
    
    datapoints = np.random.normal(mu,sigma,n)
    
    """ create normalized histrogram of the data """
    
    bins = np.linspace(0,20,21)
    H, bins = np.histogram(data,bins,density=True)
    
    

    I would like to interpret this histogram as probability density function (with e.g. 2 free parameters) so that I can use it to produce random numbers AND also I would like to use that function to fit another histogram.

    Thanks for your help

  • madzone
    madzone almost 11 years
    ,thank you for the quick reply, the interpolation was also in my mind, but as u said firstly it cant take care of the outliers and also that is not really a density functions but more a copy of the initial histogram.
  • madzone
    madzone over 10 years
    this is my final version, it works smoothly, thanks again. bins=np.linspace(0,.5,num=800) counts18, bins = np.histogram(Z_DATA[InData18], bins=bins) x=np.cumsum(counts18)*1./np.sum(counts18)*1. y=bins[range(len(x)+1)] y=y[1:] fit=scipy.interpolate.interp1d(x,y) plt.hist(fit(np.random.uniform(x[0],x[-1],len(data))),bins=y‌​) plt.hist(data,alpha=0.3,bins=y) plt.show()