How can check the distribution of a variable in python?

10,080

You can use Kolmogorove-Smirnov Test for continues and discrete distributions. This function is provided with scipy.stats.kstest http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest.

In [12]:

import scipy.stats as ss
import numpy as np
In [14]:

A=np.random.randint(0,10,100)
In [16]:

ss.kstest(A, ss.randint.cdf, args=(0,10))
#args is a tuple containing the extra parameter required by ss.randint.cdf, in this case, lower bound and upper bound
Out[16]:
(0.12, 0.10331653831438881)
#This a tuple of two values; KS test statistic, either D, D+ or D-. and p-value

Here the resulting P value is 0.1033, we therefore conclude that the array A is not significantly different from a uniform distribution. The way to think about the P value is, it measures the probability of getting the test statistic as extreme as the one observed (here: the first number in the tuple) assuming the null hypothesis is true. In KS test, we actually has the null hypothesis that A is not different from a uniform distribution. A p value of 0.1033 is often not considered as extreme enough to reject the null hypothesis. Usually the P value has to be less than 0.05 or 0.01 in order to reject the null. If this p value in this example is less than 0.05, we will then say A is significantly different from a uniform distribution.

The alternative method of using scipy.stats.chisquare():

In [17]:

import scipy.stats as ss
import numpy as np
In [18]:

A=np.random.randint(0, 10, 100)
In [19]:

FRQ=(A==np.arange(10)[...,np.newaxis]).sum(axis=1)*1./A.size #generate the expect frequecy table.
In [20]:

ss.chisquare(FRQ) #If not specified, the default expected frequency is uniform across categories.
Out[20]:
(0.084000000000000019, 0.99999998822800984)

The first value is chisquare and the second value is P value.

Share:
10,080
eduardo.sufan
Author by

eduardo.sufan

Updated on June 05, 2022

Comments

  • eduardo.sufan
    eduardo.sufan almost 2 years

    In a uni-testing I need to check the distribution of the values ​​of an array is uniform. For example:

    in an array = [1, 0, 1, 0, 1, 1, 0, 0] there is a uniform distribution of values. Since there are four "1" and four "0"

    For larger lengths of the array, the distribution is more "uniform"

    How do I prove that the array that is testing has a uniform distribution?

    note: the array is created with random.randint(min,max,len), from numpy.random

  • behzad.nouri
    behzad.nouri about 10 years
    even on the scipy page that you have linked to it is written that: "The KS test is only valid for continuous distributions."
  • CT Zhu
    CT Zhu about 10 years
    @benhzad.nouri, if we were to dig this thing deeper, I think it is fair to say that if one apply KS test for discrete distributions, one can't estimate P as it is for continuous distribution (from the distribution of the D statistics). You can still do it, by simulation. See: cran.r-project.org/web/packages/dgof (and actually has been proposed way back: oai.dtic.mil/oai/…). I do have to check the source code of scipy.stats.kstest to see if scipy does the latter when the supplied cdf is a discrete one.
  • ForeverLearner
    ForeverLearner almost 7 years
    Hi @CTZhu, can you please explain what does this line mean ?
  • redsk
    redsk over 5 years
    @CTZhu FRQ=(A==np.arange(10)[...,np.newaxis]).sum(axis=1) , you need the frequencies