OLS Regression: Scikit vs. Statsmodels?

32,581

Solution 1

It sounds like you are not feeding the same matrix of regressors X to both procedures (but see below). Here's an example to show you which options you need to use for sklearn and statsmodels to produce identical results.

import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

# Generate artificial data (2 regressors + constant)
nobs = 100 
X = np.random.random((nobs, 2)) 
X = sm.add_constant(X)
beta = [1, .1, .5] 
e = np.random.random(nobs)
y = np.dot(X, beta) + e 

# Fit regression model
sm.OLS(y, X).fit().params
>> array([ 1.4507724 ,  0.08612654,  0.60129898])

LinearRegression(fit_intercept=False).fit(X, y).coef_
>> array([ 1.4507724 ,  0.08612654,  0.60129898])

As a commenter suggested, even if you are giving both programs the same X, X may not have full column rank, and they sm/sk could be taking (different) actions under-the-hood to make the OLS computation go through (i.e. dropping different columns).

I recommend you use pandas and patsy to take care of this:

import pandas as pd
from patsy import dmatrices

dat = pd.read_csv('wow.csv')
y, X = dmatrices('levels ~ week + character + guild', data=dat)

Or, alternatively, the statsmodels formula interface:

import statsmodels.formula.api as smf
dat = pd.read_csv('wow.csv')
mod = smf.ols('levels ~ week + character + guild', data=dat).fit()

Edit: This example might be useful: http://statsmodels.sourceforge.net/devel/example_formulas.html

Solution 2

If you use statsmodels, I would highly recommend using the statsmodels formula interface instead. You will get the same old result from OLS using the statsmodels formula interface as you would from sklearn.linear_model.LinearRegression, or R, or SAS, or Excel.

smod = smf.ols(formula ='y~ x', data=df)
result = smod.fit()
print(result.summary())

When in doubt, please

  1. try reading the source code
  2. try a different language for benchmark, or
  3. try OLS from scratch, which is basic linear algebra.
Share:
32,581
Nat Poor
Author by

Nat Poor

Updated on July 09, 2022

Comments

  • Nat Poor
    Nat Poor almost 2 years

    Short version: I was using the scikit LinearRegression on some data, but I'm used to p-values so put the data into the statsmodels OLS, and although the R^2 is about the same the variable coefficients are all different by large amounts. This concerns me since the most likely problem is that I've made an error somewhere and now I don't feel confident in either output (since likely I have made one model incorrectly but don't know which one).

    Longer version: Because I don't know where the issue is, I don't know exactly which details to include, and including everything is probably too much. I am also not sure about including code or data.

    I am under the impression that scikit's LR and statsmodels OLS should both be doing OLS, and as far as I know OLS is OLS so the results should be the same.

    For scikit's LR, the results are (statistically) the same whether or not I set normalize=True or =False, which I find somewhat strange.

    For statsmodels OLS, I normalize the data using StandardScaler from sklearn. I add a column of ones so it includes an intercept (since scikit's output includes an intercept). More on that here: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html (Adding this column did not change the variable coefficients to any notable degree and the intercept was very close to zero.) StandardScaler didn't like that my ints weren't floats, so I tried this: https://github.com/scikit-learn/scikit-learn/issues/1709 That makes the warning go away but the results are exactly the same.

    Granted I'm using 5-folds cv for the sklearn approach (R^2 are consistent for both test and training data each time), and for statsmodels I just throw it all the data.

    R^2 is about 0.41 for both sklearn and statsmodels (this is good for social science). This could be a good sign or just a coincidence.

    The data is observations of avatars in WoW (from http://mmnet.iis.sinica.edu.tw/dl/wowah/) which I munged about to make it weekly with some different features. Originally this was a class project for a data science class.

    Independent variables include number of observations in a week (int), character level (int), if in a guild (Boolean), when seen (Booleans on weekday day, weekday eve, weekday late, and the same three for weekend), a dummy for character class (at the time for the data collection, there were only 8 classes in WoW, so there are 7 dummy vars and the original string categorical variable is dropped), and others.

    The dependent variable is how many levels each character gained during that week (int).

    Interestingly, some of the relative order within like variables is maintained across statsmodels and sklearn. So, rank order of "when seen" is the same although the loadings are very different, and rank order for the character class dummies is the same although again the loadings are very different.

    I think this question is similar to this one: Difference in Python statsmodels OLS and R's lm

    I am good enough at Python and stats to make a go of it, but then not good enough to figure something like this out. I tried reading the sklearn docs and the statsmodels docs, but if the answer was there staring me in the face I did not understand it.

    I would love to know:

    1. Which output might be accurate? (Granted they might both be if I missed a kwarg.)
    2. If I made a mistake, what is it and how to fix it?
    3. Could I have figured this out without asking here, and if so how?

    I know this question has some rather vague bits (no code, no data, no output), but I am thinking it is more about the general processes of the two packages. Sure, one seems to be more stats and one seems to be more machine learning, but they're both OLS so I don't understand why the outputs aren't the same.

    (I even tried some other OLS calls to triangulate, one gave a much lower R^2, one looped for five minutes and I killed it, and one crashed.)

    Thanks!

  • Nat Poor
    Nat Poor about 10 years
    Awesome thanks. Let me... well I'll post the functions I built, and then come back and try to apply these ideas. I do understand what "not feeding the same matrix" means, #win... I would hope I didn't mess up at that level, but of course it is possible.
  • Nat Poor
    Nat Poor about 10 years
    Actually I will try the code sample here before pasting in 75 lines of code (my two different function calls). I don't want to waste people's time by having them read over code if the answer is here already. (Granted, I might end up posting it if this code works and then I can't quite figure out where I went wrong, but one step at a time.) Should be able to get to it sometime today (maybe later). Thanks everyone!
  • Nat Poor
    Nat Poor about 10 years
    Ok! That code did indeed get me the same results across the two libraries for the same data! Nice! However, the numbers are totally different from the previous two I have -- good thing I asked here! I will work on figuring that out, now that I have a good starting point and some numbers that I think I can trust. (I'm a little disappointed that I managed to make two regressions and yet they went totally awry... Maybe I should stick to SPSS and R.... no way!)
  • Nat Poor
    Nat Poor about 10 years
    Summary: Ok I got SM with normalizing (StandardScaler) and also SK with CV (and with SS) to work with roughly the same results. The problem seems to be that I had to convert the integers to numpy floats (at this point I cannot recall why), and that worked for both the SM and SK (no CV) versions (worked meaning, they gave the same results and I am confident those results are accurate). When I added CV to the working SK function (with numpy floats), the R^2 went to like -5000. So, something(? perhaps obvious?) isn't working between the CV and the np floats. I take np floats out and it is ok!
  • Palu
    Palu almost 6 years
    Hi, i just wanted to add here, that in terms of sklearn, it does not use OLS method for linear regression under the hood. Since sklearn comes from the data-mining/machine-learning realm, they like to use Steepest Descent Gradient algorithm. This is a numerical method that is sensitive to initial conditions etc, while the OLS is an analytical closed form approach, so one should expect differences. So statsmodels comes from classical statistics field hence they would use OLS technique. So there are differences between the two linear regressions from the 2 different libraries.
  • Sarah
    Sarah over 4 years
    This answer is wrong. LinearRegression from sklearn uses OLS. Just look at the souce code: github.com/scikit-learn/scikit-learn/blob/1495f6924/sklearn/‌​…
  • Palu
    Palu over 4 years
    Hi, back where i answered this, i contacted the guys at sklearn, and they informed me that they did not have OLS implementation only SDG algorithm. But i did not try to look into the git code base. So thanks for finding this out Sarah. SO either the person that replied back to me was not aware, or they more recently implemented OLS. Either way, thanks for pointing this out Sarah, really appreciate it.
  • Sarah
    Sarah over 4 years
    Thank you Palu for responding with your nice comments :)
  • WestCoastProjects
    WestCoastProjects about 4 years
    statsmodels is way more friendly than scikit-learn. i'm about done with (mostly failing) to decipher the incomprehensible input and output array/matrix formats required to the latter