Difference(s) between scipy.stats.linregress, numpy.polynomial.polynomial.polyfit and statsmodels.api.OLS
Solution 1
The three are very different but overlap in the parameter estimation for the very simple example with only one explanatory variable.
By increasing generality:
scipy.stats.linregress
only handles the case of a single explanatory variable with specialized code and calculates a few extra statistics.
numpy.polynomial.polynomial.polyfit
estimates the regression for a polynomial of a single variable, but doesn't return much in terms of extra statisics.
statsmodels
OLS
is a generic linear model (OLS) estimation class. It doesn't prespecify what the explanatory variables are and can handle any multivariate array of explanatory variables, or formulas and pandas DataFrames. It not only returns the estimated parameters, but also a large set of results staistics and methods for statistical inference and prediction.
For completeness of options for estimating linear models in Python (outside of Bayesian analysis), we should also consider scikit-learn
LinearRegression
and similar linear models, which are useful for selecting among a large number of explanatory variables but does not have the large number of results that statsmodels provides.
Solution 2
Scipy seems quite a bit faster -- this is actually the opposite of what I would have expected by the way!
x = np.random.random(100000)
y = np.random.random(100000)
%timeit numpy.polynomial.polynomial.polyfit(x, y, 1)
100 loops, best of 3: 8.89 ms per loop
%timeit scipy.stats.linregress(x,y)
100 loops, best of 3: 1.67 ms per loop
Related videos on Youtube
MLister
Updated on September 16, 2022Comments
-
MLister almost 2 years
It seems all three functions can do simple linear regression, e.g.
scipy.stats.linregress(x, y) numpy.polynomial.polynomial.polyfit(x, y, 1) x = statsmodels.api.add_constant(x) statsmodels.api.OLS(y, x)
I wonder if there is any real difference between the three methods? I know that
statsmodels
are built on top ofscipy
, andscipy
is kinda dependent onnumpy
for many things, so I expect that they should not differ vastly, but devil is always in the details.More specifically, if we use the
numpy
method above, how do we get thep-value
of the slope which is given by default by the other two methods?I am using them in Python 3, if that makes any difference.
-
ali_m almost 9 yearsIt's not surprising at all when you consider how much more general
np.polyfit
is - it is not really designed for linear regression, but can instead fit a polynomial of arbitrary order to the relationship between x & y (whereaslinregress
can only fit a line). -
Luke Davis almost 7 yearsInterestingly,
polyfit
is actually quite a lot faster than bothscipy.stats.linregress
andstatsmodels.api.OLS
for small vectors. Try it again with series of length 100-1000 or so. You can also request some extra parameters, likecov
(the covariance matrix for the estimators; the diagonal gives the square standard errors), without much time penalty. So for use with smaller series, it seemsnumpy
is superior.