How to calculate the statistics "t-test" with numpy

62,190

Solution 1

In a scipy.stats package there are few ttest_... functions. See example from here:

>>> print 't-statistic = %6.3f pvalue = %6.4f' %  stats.ttest_1samp(x, m)
t-statistic =  0.391 pvalue = 0.6955

Solution 2

van's answer using scipy is exactly right and using the scipy.stats.ttest_* functions is very convenient.

But I came to this page looking for a solution with pure numpy, as stated in the heading, to avoid the scipy dependence. To this end, let me point out the example given here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.standard_t.html

The main Problem is, that numpy does not have cumulative distribution functions, hence my conclusion is that you should really use scipy. Anyway, using only numpy is possible:

From the original question I am guessing that you want to compare your datasets and judge with a t-test whether there is a significant deviation? Further, that the samples are paired? (See https://en.wikipedia.org/wiki/Student%27s_t-test#Unpaired_and_paired_two-sample_t-tests ) In that case, you can calculate the t- and p-value like so:

import numpy as np
sample1 = np.array([55.0, 55.0, 47.0, 47.0, 55.0, 55.0, 55.0, 63.0])
sample2 = np.array([54.0, 56.0, 48.0, 46.0, 56.0, 56.0, 55.0, 62.0])
# paired sample -> the difference has mean 0
difference = sample1 - sample2
# the t-value is easily computed with numpy
t = (np.mean(difference))/(difference.std(ddof=1)/np.sqrt(len(difference)))
# unfortunately, numpy does not have a build in CDF
# here is a ridiculous work-around integrating by sampling
s = np.random.standard_t(len(difference), size=100000)
p = np.sum(s<t) / float(len(s))
# using a two-sided test
print("There is a {} % probability that the paired samples stem from distributions with the same means.".format(2 * min(p, 1 - p) * 100))

This will print There is a 73.028 % probability that the paired samples stem from distributions with the same means. Since this is far above any sane confidence interval (say 5%), you should not conclude anything for the concrete case.

Share:
62,190
Mark
Author by

Mark

Updated on June 27, 2020

Comments

  • Mark
    Mark almost 4 years

    I'm looking to generate some statistics about a model I created in python. I'd like to generate the t-test on it, but was wondering if there was an easy way to do this with numpy/scipy. Are there any good explanations around?

    For example, I have three related datasets that look like this:

    [55.0, 55.0, 47.0, 47.0, 55.0, 55.0, 55.0, 63.0]
    

    Now, I would like to do the student's t-test on them.