How can check the distribution of a variable in python?
You can use Kolmogorove-Smirnov Test for continues and discrete distributions. This function is provided with scipy.stats.kstest
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest.
In [12]:
import scipy.stats as ss
import numpy as np
In [14]:
A=np.random.randint(0,10,100)
In [16]:
ss.kstest(A, ss.randint.cdf, args=(0,10))
#args is a tuple containing the extra parameter required by ss.randint.cdf, in this case, lower bound and upper bound
Out[16]:
(0.12, 0.10331653831438881)
#This a tuple of two values; KS test statistic, either D, D+ or D-. and p-value
Here the resulting P value is 0.1033, we therefore conclude that the array A
is not significantly different from a uniform distribution. The way to think about the P value is, it measures the probability of getting the test statistic as extreme as the one observed (here: the first number in the tuple) assuming the null hypothesis is true. In KS test, we actually has the null hypothesis that A
is not different from a uniform distribution. A p value of 0.1033 is often not considered as extreme enough to reject the null hypothesis. Usually the P value has to be less than 0.05 or 0.01 in order to reject the null. If this p value in this example is less than 0.05, we will then say A
is significantly different from a uniform distribution.
The alternative method of using scipy.stats.chisquare()
:
In [17]:
import scipy.stats as ss
import numpy as np
In [18]:
A=np.random.randint(0, 10, 100)
In [19]:
FRQ=(A==np.arange(10)[...,np.newaxis]).sum(axis=1)*1./A.size #generate the expect frequecy table.
In [20]:
ss.chisquare(FRQ) #If not specified, the default expected frequency is uniform across categories.
Out[20]:
(0.084000000000000019, 0.99999998822800984)
The first value is chisquare and the second value is P value.
eduardo.sufan
Updated on June 05, 2022Comments
-
eduardo.sufan almost 2 years
In a uni-testing I need to check the distribution of the values of an array is uniform. For example:
in an array =
[1, 0, 1, 0, 1, 1, 0, 0]
there is a uniform distribution of values. Since there are four "1" and four "0"For larger lengths of the array, the distribution is more "uniform"
How do I prove that the array that is testing has a uniform distribution?
note: the array is created with
random.randint(min,max,len)
, fromnumpy.random
-
behzad.nouri about 10 yearseven on the scipy page that you have linked to it is written that: "The KS test is only valid for continuous distributions."
-
CT Zhu about 10 years@benhzad.nouri, if we were to dig this thing deeper, I think it is fair to say that if one apply KS test for discrete distributions, one can't estimate P as it is for continuous distribution (from the distribution of the D statistics). You can still do it, by simulation. See: cran.r-project.org/web/packages/dgof (and actually has been proposed way back: oai.dtic.mil/oai/…). I do have to check the source code of
scipy.stats.kstest
to see ifscipy
does the latter when the suppliedcdf
is a discrete one. -
ForeverLearner almost 7 yearsHi @CTZhu, can you please explain what does this line mean ?
-
redsk over 5 years@CTZhu
FRQ=(A==np.arange(10)[...,np.newaxis]).sum(axis=1)
, you need the frequencies