Run an OLS regression with Pandas Data Frame

235,777

Solution 1

I think you can almost do exactly what you thought would be ideal, using the statsmodels package which was one of pandas' optional dependencies before pandas' version 0.20.0 (it was used for a few things in pandas.stats.)

>>> import pandas as pd
>>> import statsmodels.formula.api as sm
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> result = sm.ols(formula="A ~ B + C", data=df).fit()
>>> print(result.params)
Intercept    14.952480
B             0.401182
C             0.000352
dtype: float64
>>> print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      A   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.158
Method:                 Least Squares   F-statistic:                     1.375
Date:                Thu, 14 Nov 2013   Prob (F-statistic):              0.421
Time:                        20:04:30   Log-Likelihood:                -18.178
No. Observations:                   5   AIC:                             42.36
Df Residuals:                       2   BIC:                             41.19
Df Model:                           2                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     14.9525     17.764      0.842      0.489       -61.481    91.386
B              0.4012      0.650      0.617      0.600        -2.394     3.197
C              0.0004      0.001      0.650      0.583        -0.002     0.003
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.061
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.498
Skew:                          -0.123   Prob(JB):                        0.780
Kurtosis:                       1.474   Cond. No.                     5.21e+04
==============================================================================

Warnings:
[1] The condition number is large, 5.21e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Solution 2

Note: pandas.stats has been removed with 0.20.0


It's possible to do this with pandas.stats.ols:

>>> from pandas.stats.api import ols
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> res = ols(y=df['A'], x=df[['B','C']])
>>> res
-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <B> + <C> + <intercept>

Number of Observations:         5
Number of Degrees of Freedom:   3

R-squared:         0.5789
Adj R-squared:     0.1577

Rmse:             14.5108

F-stat (2, 2):     1.3746, p-value:     0.4211

Degrees of Freedom: model 2, resid 2

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             B     0.4012     0.6497       0.62     0.5999    -0.8723     1.6746
             C     0.0004     0.0005       0.65     0.5826    -0.0007     0.0014
     intercept    14.9525    17.7643       0.84     0.4886   -19.8655    49.7705
---------------------------------End of Summary---------------------------------

Note that you need to have statsmodels package installed, it is used internally by the pandas.stats.ols function.

Solution 3

I don't know if this is new in sklearn or pandas, but I'm able to pass the data frame directly to sklearn without converting the data frame to a numpy array or any other data types.

from sklearn import linear_model

reg = linear_model.LinearRegression()
reg.fit(df[['B', 'C']], df['A'])

>>> reg.coef_
array([  4.01182386e-01,   3.51587361e-04])

Solution 4

This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place.

No it doesn't, just convert to a NumPy array:

>>> data = np.asarray(df)

This takes constant time because it just creates a view on your data. Then feed it to scikit-learn:

>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression()
>>> X, y = data[:, 1:], data[:, 0]
>>> lr.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
>>> lr.coef_
array([  4.01182386e-01,   3.51587361e-04])
>>> lr.intercept_
14.952479503953672

Solution 5

Statsmodels kan build an OLS model with column references directly to a pandas dataframe.

Short and sweet:

model = sm.OLS(df[y], df[x]).fit()


Code details and regression summary:

# imports
import pandas as pd
import statsmodels.api as sm
import numpy as np

# data
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('ABC'))

# assign dependent and independent / explanatory variables
variables = list(df.columns)
y = 'A'
x = [var for var in variables if var not in y ]

# Ordinary least squares regression
model_Simple = sm.OLS(df[y], df[x]).fit()

# Add a constant term like so:
model = sm.OLS(df[y], sm.add_constant(df[x])).fit()

model.summary()

Output:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      A   R-squared:                       0.019
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                    0.9409
Date:                Thu, 14 Feb 2019   Prob (F-statistic):              0.394
Time:                        08:35:04   Log-Likelihood:                -484.49
No. Observations:                 100   AIC:                             975.0
Df Residuals:                      97   BIC:                             982.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         43.4801      8.809      4.936      0.000      25.996      60.964
B              0.1241      0.105      1.188      0.238      -0.083       0.332
C             -0.0752      0.110     -0.681      0.497      -0.294       0.144
==============================================================================
Omnibus:                       50.990   Durbin-Watson:                   2.013
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                6.905
Skew:                           0.032   Prob(JB):                       0.0317
Kurtosis:                       1.714   Cond. No.                         231.
==============================================================================

How to directly get R-squared, Coefficients and p-value:

# commands:
model.params
model.pvalues
model.rsquared

# demo:
In[1]: 
model.params
Out[1]:
const    43.480106
B         0.124130
C        -0.075156
dtype: float64

In[2]: 
model.pvalues
Out[2]: 
const    0.000003
B        0.237924
C        0.497400
dtype: float64

Out[3]:
model.rsquared
Out[2]:
0.0190
Share:
235,777
Michael
Author by

Michael

Updated on February 14, 2021

Comments

  • Michael
    Michael over 3 years

    I have a pandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:

    import pandas as pd
    df = pd.DataFrame({"A": [10,20,30,40,50], 
                       "B": [20, 30, 10, 40, 50], 
                       "C": [32, 234, 23, 23, 42523]})
    

    Ideally, I would have something like ols(A ~ B + C, data = df) but when I look at the examples from algorithm libraries like scikit-learn it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame?

  • cjohnson318
    cjohnson318 over 10 years
    I had to do np.matrix( np.asarray( df ) ), because sklearn expected a vertical vector, whereas numpy arrays, once you slice them off an array, act like horizontal vecotrs, which is great most of the time.
  • MichaelChirico
    MichaelChirico over 9 years
    no simple way to do tests of the coefficients with this route, however
  • Femto Trader
    Femto Trader about 9 years
    Isn't there a way to directly feed Scikit-Learn with Pandas DataFrame ?
  • szeitlin
    szeitlin about 9 years
    for other sklearn modules (decision tree, etc), I've used df['colname'].values, but that didn't work for this.
  • denfromufa
    denfromufa about 8 years
    Note that this is going to be deprecated in future version of pandas!
  • denfromufa
    denfromufa over 7 years
    Note that correct keyword is formula, I accidentally typed formulas instead and got weird error: TypeError: from_formula() takes at least 3 arguments (2 given)
  • FaCoffee
    FaCoffee over 7 years
    Why are doing it? I vividly hope this function survives! It is REALLY useful and quick!
  • 3novak
    3novak over 7 years
    You could also use the .values attribute. I.e., reg.fit(df[['B', 'C']].values, df['A'].values).
  • WestCoastProjects
    WestCoastProjects over 7 years
    The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://www.statsmodels.org/stable/regression.html
  • a.powell
    a.powell about 7 years
    @DSM Very new to python. Tried running your same code and got errors on both print messages: print result.summary() ^ SyntaxError: invalid syntax >>> print result.parmas File "<stdin>", line 1 print result.parmas ^ SyntaxError: Missing parentheses in call to 'print'...Maybe I loaded packages wrong?? It appears to work when I don't put "print". Thanks.
  • Party Time
    Party Time about 7 years
    @a.powell The OP's code is for Python 2. The only change I think you need to make is to put parentheses round the arguments to print: print(result.params) and print(result.summary())
  • WestCoastProjects
    WestCoastProjects almost 7 years
    @DestaHaileselassieHagos . This may be due to issue with missing intercepts. The designer of the equivalent R package adjusts by removing the adjustment for the mean: stats.stackexchange.com/a/36068/64552 . . Other suggestions: you can use sm.add_constant to add an intercept to the exog array and use a dict: reg = ols("y ~ x", data=dict(y=y,x=x)).fit()
  • Desta Haileselassie Hagos
    Desta Haileselassie Hagos almost 7 years
    I would appreciate if you could have a look at this and thank you: stackoverflow.com/questions/44923808/…
  • S3DEV
    S3DEV over 6 years
    Small diversion from the OP - but I found this particular answer very helpful, after appending .values.reshape(-1, 1) to the dataframe columns. For example: x_data = df['x_data'].values.reshape(-1, 1) and passing the x_data (and a similarly created y_data) np arrays into the .fit() method.
  • 3kstc
    3kstc about 6 years
    It was a sad day when they removed the pandas.stats 💔
  • CPBL
    CPBL about 6 years
    @RomanPekar : ... Sorry, but do standards allow making your removal comment at the top of your answer larger and bolder? :) Or moving it below the grey line. My eyes kept going to the "It's possible" part...
  • 3pitt
    3pitt almost 6 years
    attempting to use this formula() approach throws the type error TypeError: __init__() missing 1 required positional argument: 'endog', so i guess it's deprecated. also, ols is now OLS
  • Lucas H
    Lucas H about 5 years
    As others mention, sm.ols has been deprecated in favor of sm.OLS. The default behavior is also different. To run a regression from formula as done here, you need to do: result = sm.OLS.from_formula(formula="A ~ B + C", data=df).fit()
  • Bill
    Bill about 5 years
    As far as I can tell, as of Mar 2019, this is the only working example of doing a regression from a pandas DataFrame on the entire Internet.
  • Golden Lion
    Golden Lion over 3 years
    there is a strange data item for C 42523. It is an outlier. It should probably be removed or imputed to the average less 425323