Plotting probability density function by sample with matplotlib

90,058

Solution 1

If you want to plot a distribution, and you know it, define it as a function, and plot it as so:

import numpy as np
from matplotlib import pyplot as plt

def my_dist(x):
    return np.exp(-x ** 2)

x = np.arange(-100, 100)
p = my_dist(x)
plt.plot(x, p)
plt.show()

If you don't have the exact distribution as an analytical function, perhaps you can generate a large sample, take a histogram and somehow smooth the data:

import numpy as np
from scipy.interpolate import UnivariateSpline
from matplotlib import pyplot as plt

N = 1000
n = N//10
s = np.random.normal(size=N)   # generate your data sample with N elements
p, x = np.histogram(s, bins=n) # bin it into n = N//10 bins
x = x[:-1] + (x[1] - x[0])/2   # convert bin edges to centers
f = UnivariateSpline(x, p, s=n)
plt.plot(x, f(x))
plt.show()

You can increase or decrease s (smoothing factor) within the UnivariateSpline function call to increase or decrease smoothing. For example, using the two you get: dist to func

Solution 2

What you have to do is to use the gaussian_kde from the scipy.stats.kde package.

given your data you can do something like this:

from scipy.stats.kde import gaussian_kde
from numpy import linspace
# create fake data
data = randn(1000)
# this create the kernel, given an array it will estimate the probability over that values
kde = gaussian_kde( data )
# these are the values over wich your kernel will be evaluated
dist_space = linspace( min(data), max(data), 100 )
# plot the results
plt.plot( dist_space, kde(dist_space) )

The kernel density can be configured at will and can handle N-dimensional data with ease. It will also avoid the spline distorsion that you can see in the plot given by askewchan.

enter image description here

Share:
90,058

Related videos on Youtube

Cupitor
Author by

Cupitor

Updated on February 20, 2020

Comments

  • Cupitor
    Cupitor over 4 years

    I want to plot an approximation of probability density function based on a sample that I have; The curve that mimics the histogram behaviour. I can have samples as big as I want.

    • askewchan
      askewchan over 11 years
      What is your sample? Is it a distribution, or actual data?
    • Cupitor
      Cupitor over 11 years
      I don't understand how could somebody vote down this question?! I mean based on what???
    • askewchan
      askewchan over 11 years
      usually on Stack Overflow people will upvote questions that are immediately clear and also show some attempt by the asker to answer their own question. "What have you tried?" Usually downvotes are accompanied by comments though, so I'm not sure why that didn't happen in this case.
    • Cupitor
      Cupitor over 11 years
      I see. Thanks for explanation... Sometimes these things make me think democracy sucks!
    • askewchan
      askewchan over 11 years
      heh, yeah. the faq are pretty useful for outlining what people expect to be (and not to be) in a question. And aside from 'reputation' more upvotes will make your questions get more visibility and attention.
    • Cupitor
      Cupitor over 11 years
      thanks. I will try to read it :) That is also true! I will try to be more clear the next time!
    • Toma
      Toma almost 3 years
      solved my problem
  • Cupitor
    Cupitor over 11 years
    that doesn't help in my case. I already wrote my sampling function and it is not exact for samples of size one lets say!
  • askewchan
    askewchan over 11 years
    Then I think you should edit your question to be more clear. This answers your question assuming you "have the distribution".
  • Cupitor
    Cupitor over 11 years
    Thank you. But I get the following error: raise ValueError("x and y arrays must be equal in length along " ValueError: x and y arrays must be equal in length along interpolation axis.
  • askewchan
    askewchan over 11 years
    @Naji Sorry about that, it should work now, with a working example of a normal distribution.
  • Cupitor
    Cupitor over 11 years
    I still get the following error: f = UnivariateSpline(x, 0.5, s=200) File "/Library/Python/2.7/site-packages/scipy/interpolate/fitpack‌​2.py", line 143, in init xb=bbox[0],xe=bbox[1],s=s) dfitpack.error: failed in converting 2nd argument `y' of dfitpack.fpcurf0 to C/Fortran array
  • askewchan
    askewchan over 11 years
    UnivariateSpline takes two lists or arrays, x and y which must have the same shape. You've given it x and 0.5, so they're not the same shape. I've used p and x where p is the probability of finding x (plus or minus dx). p is basically your histogram height, or probability distribution, which you said you could generate.
  • askewchan
    askewchan over 11 years
  • Sitz Blogz
    Sitz Blogz over 8 years
    I am looking for a similar solution. I have a data-set already but I do not know what distribution does it have so I am trying to plot a Probability distribution function using python and I dont happen to know how to plot that. Any help is appreciated in that case.
  • Alessandro Jacopson
    Alessandro Jacopson almost 8 years
    @SitzBlogz Let's say your data-set is called data, then just remove the line data = randn(1000) in @EnricoGiampieri answer and you're done!
  • Ajay Ohri
    Ajay Ohri over 6 years
    you should use n =int( N/10) to avoid error from float type
  • askewchan
    askewchan over 6 years
    Good point @Ajay, I should update this! When I wrote this five years ago, n was an int because I was using python 2, and most of the audience probably was too.