Multiple linear regression in Python

python numpy statistics scipy linear-regression

219,477

Solution 1

sklearn.linear_model.LinearRegression will do it:

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
        [t.y for t in texts])

Then clf.coef_ will have the regression coefficients.

sklearn.linear_model also has similar interfaces to do various kinds of regularizations on the regression.

Solution 2

Here is a little work around that I created. I checked it with R and it works correct.

import numpy as np
import statsmodels.api as sm

y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]

x = [
     [4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
     [4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
     [4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
     ]

def reg_m(y, x):
    ones = np.ones(len(x[0]))
    X = sm.add_constant(np.column_stack((x[0], ones)))
    for ele in x[1:]:
        X = sm.add_constant(np.column_stack((ele, X)))
    results = sm.OLS(y, X).fit()
    return results

Result:

print reg_m(y, x).summary()

Output:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.535
Model:                            OLS   Adj. R-squared:                  0.461
Method:                 Least Squares   F-statistic:                     7.281
Date:                Tue, 19 Feb 2013   Prob (F-statistic):            0.00191
Time:                        21:51:28   Log-Likelihood:                -26.025
No. Observations:                  23   AIC:                             60.05
Df Residuals:                      19   BIC:                             64.59
Df Model:                           3                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.2424      0.139      1.739      0.098        -0.049     0.534
x2             0.2360      0.149      1.587      0.129        -0.075     0.547
x3            -0.0618      0.145     -0.427      0.674        -0.365     0.241
const          1.5704      0.633      2.481      0.023         0.245     2.895

==============================================================================
Omnibus:                        6.904   Durbin-Watson:                   1.905
Prob(Omnibus):                  0.032   Jarque-Bera (JB):                4.708
Skew:                          -0.849   Prob(JB):                       0.0950
Kurtosis:                       4.426   Cond. No.                         38.6

pandas provides a convenient way to run OLS as given in this answer:

Run an OLS regression with Pandas Data Frame

Solution 3

Just to clarify, the example you gave is multiple linear regression, not multivariate linear regression refer. Difference:

The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector-valued predictor variables (denoted with a capital X) is known as multiple linear regression, also known as multivariable linear regression. Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases the response variable y is still a scalar. Another term multivariate linear regression refers to cases where y is a vector, i.e., the same as general linear regression. The difference between multivariate linear regression and multivariable linear regression should be emphasized as it causes much confusion and misunderstanding in the literature.

In short:

multiple linear regression: the response y is a scalar.
multivariate linear regression: the response y is a vector.

(Another source.)

Solution 4

You can use numpy.linalg.lstsq:

import numpy as np

y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array(
    [
        [-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
        [-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
        [-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
        [14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
        [4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
        [0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
        [0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
    ]
)
X = X.T  # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])]  # add bias term
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
print(beta_hat)

Result:

[ -0.49104607   0.83271938   0.0860167    0.1326091    6.85681762  22.98163883 -41.08437805 -19.08085066]

You can see the estimated output with:

print(np.dot(X,beta_hat))

Result:

[ -5.97751163,  -5.06465759, -10.16873217,  -4.96959788,  -7.96356915,  -3.06176313,  -6.01818435,  -7.90878145,  -7.86720264]

Solution 5

Use scipy.optimize.curve_fit. And not only for linear fit.

from scipy.optimize import curve_fit
import scipy

def fn(x, a, b, c):
    return a + b*x[0] + c*x[1]

# y(x0,x1) data:
#    x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4

x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt

View more solutions

219,477

Author by

Zach

I am interested in natural language processing and in democracy.

Updated on July 08, 2022

Comments

Zach almost 2 years
I can't seem to find any python libraries that do multiple regression. The only things I find only do simple regression. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.).

For example, with this data:
```
print 'y        x1      x2       x3       x4      x5     x6       x7'
for t in texts:
    print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
   .format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)
```
(output for above:)
```
      y        x1       x2       x3        x4     x5     x6       x7
   -6.0     -4.95    -5.87    -0.76     14.73   4.02   0.20     0.45
   -5.0     -4.55    -4.52    -0.71     13.74   4.47   0.16     0.50
  -10.0    -10.96   -11.64    -0.98     15.49   4.18   0.19     0.53
   -5.0     -1.08    -3.36     0.75     24.72   4.96   0.16     0.60
   -8.0     -6.52    -7.45    -0.86     16.59   4.29   0.10     0.48
   -3.0     -0.81    -2.36    -0.50     22.44   4.81   0.15     0.53
   -6.0     -7.01    -7.33    -0.33     13.93   4.32   0.21     0.50
   -8.0     -4.46    -7.65    -0.94     11.40   4.43   0.16     0.49
   -8.0    -11.54   -10.03    -1.03     18.18   4.28   0.21     0.55
```
How would I regress these in python, to get the linear regression formula:

Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c
- BasedRebel almost 12 years
  
  not an expert, but if the variables are independent, can't you just run simple regression against each and sum the result?
- hlin117 about 9 years
  
  @HughBothwell You can't assume that the variables are independent though. In fact, if you're assuming that the variables are independent, you may potentially be modeling your data incorrectly. In other words, the responses Y may be correlated with each other, but assuming independence does not accurately model the dataset.
- Charlie Parker over 6 years
  
  @HughBothwell sorry if this a dum question, but why does it matter if the raw feature variables x_i are independent or not? How does that affect the predictor (=model)?
Zach almost 12 years

This returns an error with certain inputs. Any other solutions available?
Zach almost 12 years

How can you use this to get the coefficents of a multivariate regression? I only see how to do a simple regression... and don't see how to get the coefficents..
user961627 about 10 years

@Dougal can sklearn.linear_model.LinearRegression be used for weighted multivariate regression as well?
cd98 over 9 years

The reg_m function is unnecessarily complicated. x = np.array(x).T, x = sm.add_constant(x) and results = sm.OLS(endog=y, exog=x).fit() is enough.
Imran over 9 years

To fit a constant term: clf = linear_model.LinearRegression(fit_intercept=True)
Huanian Zhang about 8 years

Follow up, do you know how to get the confidence level using sklearn.linear_model.LinearRegression? Thanks.
Huanian Zhang about 8 years

This is a nice tool. Just ask one question: in this case, the t value is outside the 95.5% confidence interval, so it means this fitting is not accurate at all, or how do you explain this?
Danica about 8 years

@HuanianZhang what do you mean by confidence level? If you want the coefficient of determination, the score method will do it; sklearn.metrics has some other model evaluation criteria. If you want the stuff like in Akavall's answer, statsmodels has some more R-like diagnostics.
Huanian Zhang about 8 years

@Dougal Thank you very much. I want the stuff like in Akavall's answer, but I do not find such sklearn functions to do that. Another question, I used the above data, the regression coefficients from your method are totally different from the method of Akavall, do you know why is that?
Danica about 8 years

@HuanianZhang I just checked that I got the same regression coefficients using this approach, Akavall's, and canary_in_the_data_mine's. They all agreed.
Huanian Zhang about 8 years

@Dougal Sorry, It is my bad. It depends on fit_intercept=True or False. I changed this to be False, that is why I got totally different coefficients. Thank you so much. Have a nice day.
sophiadw about 8 years

Just noticed that your x1, x2, x3 are in reverse order in your original predictor list, i.e., x = [x3, x2, x1]?
jxn over 7 years

This notebook is awesome. it shows how to regress multiple independent variables (x1,x2,x3...) on Y with just 3 lines of code and using scikit learn.
dd90p over 7 years

may i know what is difference between print np.dot(X,beta_hat)... and mod_wls = sm.WLS(y, X, weights=weights) res = mod_wls.fit() predsY=res.predict() they all return the Y result
Akavall over 6 years

This might be useful information, but I don't see how it answers the question.
Franck Dernoncourt over 6 years

@Akavall using the correct terminology is the first step to find an answer.
BlueCoder about 6 years

@user961627 clf.fit accepts weights as well: fit(self, X, y, sample_weight=None)
Jai Prakash almost 6 years

@canary_in_the_data_mine thanks for the notebook. how can i plot linear regression which has multiple features? I couldn't find in the notebook. any pointers will be greatly appreciated. -- Thanks
alwaysaskingquestions almost 6 years

@FranckDernoncourt but OP's Y value IS a vector?
Ashrith Reddy almost 6 years

@sophiadw you can just add x = x[::-1] within function definition to get in right order
bluedroid over 5 years

Does it add the intercept because we have to add the intercept by passing smf.add_intercept() as a parameter to ols()
Sam Mason about 5 years

@HuanianZhang "t value" is just how many standard deviations the coefficient is away from zero, while 95%CI is approximately coef +- 2 * std err (actually the Student-t distribution parameterized by degrees of freedom in the residuals). i.e. larger absolute t values imply CIs further from zero, but they shouldn't be directly compared. clarification is a bit late, but hope it's useful to somebody
Mac over 4 years

@FranckDernoncourt: "using the correct terminology is the first step to find an answer". Great, so we both can agree: this, in and of itself, is not actually an answer. Users should be able to solve their problem directly from answers without having to resort to looking up other resources.