Run an OLS regression with Pandas Data Frame
Solution 1
I think you can almost do exactly what you thought would be ideal, using the statsmodels package which was one of pandas
' optional dependencies before pandas
' version 0.20.0 (it was used for a few things in pandas.stats
.)
>>> import pandas as pd
>>> import statsmodels.formula.api as sm
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> result = sm.ols(formula="A ~ B + C", data=df).fit()
>>> print(result.params)
Intercept 14.952480
B 0.401182
C 0.000352
dtype: float64
>>> print(result.summary())
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.579
Model: OLS Adj. R-squared: 0.158
Method: Least Squares F-statistic: 1.375
Date: Thu, 14 Nov 2013 Prob (F-statistic): 0.421
Time: 20:04:30 Log-Likelihood: -18.178
No. Observations: 5 AIC: 42.36
Df Residuals: 2 BIC: 41.19
Df Model: 2
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 14.9525 17.764 0.842 0.489 -61.481 91.386
B 0.4012 0.650 0.617 0.600 -2.394 3.197
C 0.0004 0.001 0.650 0.583 -0.002 0.003
==============================================================================
Omnibus: nan Durbin-Watson: 1.061
Prob(Omnibus): nan Jarque-Bera (JB): 0.498
Skew: -0.123 Prob(JB): 0.780
Kurtosis: 1.474 Cond. No. 5.21e+04
==============================================================================
Warnings:
[1] The condition number is large, 5.21e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Solution 2
Note: pandas.stats
has been removed with 0.20.0
It's possible to do this with pandas.stats.ols
:
>>> from pandas.stats.api import ols
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> res = ols(y=df['A'], x=df[['B','C']])
>>> res
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <B> + <C> + <intercept>
Number of Observations: 5
Number of Degrees of Freedom: 3
R-squared: 0.5789
Adj R-squared: 0.1577
Rmse: 14.5108
F-stat (2, 2): 1.3746, p-value: 0.4211
Degrees of Freedom: model 2, resid 2
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
B 0.4012 0.6497 0.62 0.5999 -0.8723 1.6746
C 0.0004 0.0005 0.65 0.5826 -0.0007 0.0014
intercept 14.9525 17.7643 0.84 0.4886 -19.8655 49.7705
---------------------------------End of Summary---------------------------------
Note that you need to have statsmodels
package installed, it is used internally by the pandas.stats.ols
function.
Solution 3
I don't know if this is new in sklearn
or pandas
, but I'm able to pass the data frame directly to sklearn
without converting the data frame to a numpy array or any other data types.
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(df[['B', 'C']], df['A'])
>>> reg.coef_
array([ 4.01182386e-01, 3.51587361e-04])
Solution 4
This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place.
No it doesn't, just convert to a NumPy array:
>>> data = np.asarray(df)
This takes constant time because it just creates a view on your data. Then feed it to scikit-learn:
>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression()
>>> X, y = data[:, 1:], data[:, 0]
>>> lr.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
>>> lr.coef_
array([ 4.01182386e-01, 3.51587361e-04])
>>> lr.intercept_
14.952479503953672
Solution 5
Statsmodels kan build an OLS model with column references directly to a pandas dataframe.
Short and sweet:
model = sm.OLS(df[y], df[x]).fit()
Code details and regression summary:
# imports
import pandas as pd
import statsmodels.api as sm
import numpy as np
# data
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('ABC'))
# assign dependent and independent / explanatory variables
variables = list(df.columns)
y = 'A'
x = [var for var in variables if var not in y ]
# Ordinary least squares regression
model_Simple = sm.OLS(df[y], df[x]).fit()
# Add a constant term like so:
model = sm.OLS(df[y], sm.add_constant(df[x])).fit()
model.summary()
Output:
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.019
Model: OLS Adj. R-squared: -0.001
Method: Least Squares F-statistic: 0.9409
Date: Thu, 14 Feb 2019 Prob (F-statistic): 0.394
Time: 08:35:04 Log-Likelihood: -484.49
No. Observations: 100 AIC: 975.0
Df Residuals: 97 BIC: 982.8
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 43.4801 8.809 4.936 0.000 25.996 60.964
B 0.1241 0.105 1.188 0.238 -0.083 0.332
C -0.0752 0.110 -0.681 0.497 -0.294 0.144
==============================================================================
Omnibus: 50.990 Durbin-Watson: 2.013
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6.905
Skew: 0.032 Prob(JB): 0.0317
Kurtosis: 1.714 Cond. No. 231.
==============================================================================
How to directly get R-squared, Coefficients and p-value:
# commands:
model.params
model.pvalues
model.rsquared
# demo:
In[1]:
model.params
Out[1]:
const 43.480106
B 0.124130
C -0.075156
dtype: float64
In[2]:
model.pvalues
Out[2]:
const 0.000003
B 0.237924
C 0.497400
dtype: float64
Out[3]:
model.rsquared
Out[2]:
0.0190
Michael
Updated on February 14, 2021Comments
-
Michael over 3 years
I have a
pandas
data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:import pandas as pd df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
Ideally, I would have something like
ols(A ~ B + C, data = df)
but when I look at the examples from algorithm libraries likescikit-learn
it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame? -
cjohnson318 over 10 yearsI had to do
np.matrix( np.asarray( df ) )
, because sklearn expected a vertical vector, whereas numpy arrays, once you slice them off an array, act like horizontal vecotrs, which is great most of the time. -
MichaelChirico over 9 yearsno simple way to do tests of the coefficients with this route, however
-
Femto Trader about 9 yearsIsn't there a way to directly feed Scikit-Learn with Pandas DataFrame ?
-
szeitlin about 9 yearsfor other sklearn modules (decision tree, etc), I've used df['colname'].values, but that didn't work for this.
-
denfromufa about 8 yearsNote that this is going to be deprecated in future version of pandas!
-
denfromufa over 7 yearsNote that correct keyword is
formula
, I accidentally typedformulas
instead and got weird error:TypeError: from_formula() takes at least 3 arguments (2 given)
-
FaCoffee over 7 yearsWhy are doing it? I vividly hope this function survives! It is REALLY useful and quick!
-
3novak over 7 yearsYou could also use the
.values
attribute. I.e.,reg.fit(df[['B', 'C']].values, df['A'].values)
. -
WestCoastProjects over 7 years
The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://www.statsmodels.org/stable/regression.html
-
a.powell about 7 years@DSM Very new to python. Tried running your same code and got errors on both print messages: print result.summary() ^ SyntaxError: invalid syntax >>> print result.parmas File "<stdin>", line 1 print result.parmas ^ SyntaxError: Missing parentheses in call to 'print'...Maybe I loaded packages wrong?? It appears to work when I don't put "print". Thanks.
-
Party Time about 7 years@a.powell The OP's code is for Python 2. The only change I think you need to make is to put parentheses round the arguments to print:
print(result.params)
andprint(result.summary())
-
WestCoastProjects almost 7 years@DestaHaileselassieHagos . This may be due to issue with
missing intercepts
. The designer of the equivalentR
package adjusts by removing the adjustment for the mean: stats.stackexchange.com/a/36068/64552 . . Other suggestions:you can use sm.add_constant to add an intercept to the exog array
and use a dict:reg = ols("y ~ x", data=dict(y=y,x=x)).fit()
-
Desta Haileselassie Hagos almost 7 yearsI would appreciate if you could have a look at this and thank you: stackoverflow.com/questions/44923808/…
-
S3DEV over 6 yearsSmall diversion from the OP - but I found this particular answer very helpful, after appending
.values.reshape(-1, 1)
to the dataframe columns. For example:x_data = df['x_data'].values.reshape(-1, 1)
and passing thex_data
(and a similarly createdy_data
) np arrays into the.fit()
method. -
3kstc about 6 yearsIt was a sad day when they removed the
pandas.stats
💔 -
CPBL about 6 years@RomanPekar : ... Sorry, but do standards allow making your removal comment at the top of your answer larger and bolder? :) Or moving it below the grey line. My eyes kept going to the "It's possible" part...
-
3pitt almost 6 yearsattempting to use this
formula()
approach throws the type error TypeError: __init__() missing 1 required positional argument: 'endog', so i guess it's deprecated. also,ols
is nowOLS
-
Lucas H about 5 yearsAs others mention, sm.ols has been deprecated in favor of sm.OLS. The default behavior is also different. To run a regression from formula as done here, you need to do:
result = sm.OLS.from_formula(formula="A ~ B + C", data=df).fit()
-
Bill about 5 yearsAs far as I can tell, as of Mar 2019, this is the only working example of doing a regression from a pandas DataFrame on the entire Internet.
-
Golden Lion over 3 yearsthere is a strange data item for C 42523. It is an outlier. It should probably be removed or imputed to the average less 425323