Scikit-Learn: Std.Error, p-Value from LinearRegression

12,469

Here is reg is output of lin regression fit method of sklearn

to calculate adjusted r2

def adjustedR2(x,y reg):
  r2 = reg.score(x,y)
  n = x.shape[0]
  p = x.shape[1]
  adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
  return adjusted_r2

and for p values

from sklearn.feature_selection import f_regression

freg=f_regression(x,y)

p=freg[1]

print(p.round(3))
Share:
12,469
user6083088
Author by

user6083088

Updated on June 18, 2022

Comments

  • user6083088
    user6083088 almost 2 years

    I've been trying to get the standard error & p-Values by using LR from scikit-learn. But no success.

    I've end up finding up this article: but the std error & p-value does not match that from the statsmodel.api OLS method

    import numpy as np 
    from sklearn import datasets
    from sklearn import linear_model
    import regressor
    import statsmodels.api as sm 
    
    
    boston = datasets.load_boston()
    which_betas = np.ones(13, dtype=bool)
    which_betas[3] = False
    X = boston.data[:,which_betas]
    y = boston.target
    
    #scikit + regressor stats
    ols = linear_model.LinearRegression()
    ols.fit(X,y)
    
    xlables = boston.feature_names[which_betas]
    regressor.summary(ols, X, y, xlables)
    
    
    # statsmodel
    x2 = sm.add_constant(X)
    models = sm.OLS(y,x2)
    result = models.fit()
    print result.summary()
    

    Output as follows:

    Residuals:
    Min      1Q  Median      3Q      Max
    -26.3743 -1.9207  0.6648  2.8112  13.3794
    
    
    Coefficients:
                 Estimate  Std. Error  t value   p value
    _intercept  36.925033    4.915647   7.5117  0.000000
    CRIM        -0.112227    0.031583  -3.5534  0.000416
    ZN           0.047025    0.010705   4.3927  0.000014
    INDUS        0.040644    0.055844   0.7278  0.467065
    NOX        -17.396989    3.591927  -4.8434  0.000002
    RM           3.845179    0.272990  14.0854  0.000000
    AGE          0.002847    0.009629   0.2957  0.767610
    DIS         -1.485557    0.180530  -8.2289  0.000000
    RAD          0.327895    0.061569   5.3257  0.000000
    TAX         -0.013751    0.001055 -13.0395  0.000000
    PTRATIO     -0.991733    0.088994 -11.1438  0.000000
    B            0.009827    0.001126   8.7256  0.000000
    LSTAT       -0.534914    0.042128 -12.6973  0.000000
    ---
    R-squared:  0.73547,    Adjusted R-squared:  0.72904
    F-statistic: 114.23 on 12 features
                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                      y   R-squared:                       0.735
    Model:                            OLS   Adj. R-squared:                  0.729
    Method:                 Least Squares   F-statistic:                     114.2
    Date:                Sun, 21 Aug 2016   Prob (F-statistic):          7.59e-134
    Time:                        21:56:26   Log-Likelihood:                -1503.8
    No. Observations:                 506   AIC:                             3034.
    Df Residuals:                     493   BIC:                             3089.
    Df Model:                          12                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
    ------------------------------------------------------------------------------
    const         36.9250      5.148      7.173      0.000        26.811    47.039
    x1            -0.1122      0.033     -3.405      0.001        -0.177    -0.047
    x2             0.0470      0.014      3.396      0.001         0.020     0.074
    x3             0.0406      0.062      0.659      0.510        -0.081     0.162
    x4           -17.3970      3.852     -4.516      0.000       -24.966    -9.828
    x5             3.8452      0.421      9.123      0.000         3.017     4.673
    x6             0.0028      0.013      0.214      0.831        -0.023     0.029
    x7            -1.4856      0.201     -7.383      0.000        -1.881    -1.090
    x8             0.3279      0.067      4.928      0.000         0.197     0.459
    x9            -0.0138      0.004     -3.651      0.000        -0.021    -0.006
    x10           -0.9917      0.131     -7.547      0.000        -1.250    -0.734
    x11            0.0098      0.003      3.635      0.000         0.005     0.015
    x12           -0.5349      0.051    -10.479      0.000        -0.635    -0.435
    ==============================================================================
    Omnibus:                      190.837   Durbin-Watson:                   1.015
    Prob(Omnibus):                  0.000   Jarque-Bera (JB):              897.143
    Skew:                           1.619   Prob(JB):                    1.54e-195
    Kurtosis:                       8.663   Cond. No.                     1.51e+04
    ==============================================================================
    
    Warnings:
    [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
    [2] The condition number is large, 1.51e+04. This might indicate that there are
    strong multicollinearity or other numerical problems.
    

    I've also found the following articles

    Both the codes in the SO link doesn't compile

    Here is my code & data that I'm working on - but not being able to find the std error & p-values

    import pandas as pd
    import statsmodels.api as sm
    import numpy as np
    import scipy
    from sklearn.linear_model import LinearRegression
    from sklearn import metrics 
    
    
    def readFile(filename, sheetname):
        xlsx = pd.ExcelFile(filename)
        data = xlsx.parse(sheetname, skiprows=1)
        return data
    
    
    def lr_statsmodel(X,y):
        X = sm.add_constant(X)
        model = sm.OLS(y,X)
        results = model.fit()
        print (results.summary())
    
    
    def lr_scikit(X,y,featureCols):
        model = LinearRegression()
        results = model.fit(X,y)
    
        predictions =  results.predict(X)
    
        print 'Coefficients'
        print 'Intercept\t' , results.intercept_
        df = pd.DataFrame(zip(featureCols, results.coef_))
        print df.to_string(index=False, header=False)
    
        # Query:: The numbers matches with Excel OLS but skeptical about relating score as rsquared
        rSquare = results.score(X,y)
        print '\nR-Square::', rSquare
    
        # This looks like a better option
        # source: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score
        r2 = metrics.r2_score(y,results.predict(X))
        print 'r2', r2
    
        # Query: No clue at all! http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics 
        print 'Rsquared?!' , metrics.explained_variance_score(y, results.predict(X))
        # INFO:: All three of them are providing the same figures!     
    
    
        # Adj-Rsquare formula @ https://www.easycalculation.com/statistics/learn-adjustedr2.php
        # In ML, we don't use all of the data for training, and hence its highly unusual to find AdjRsquared. Thus the need for manual calculation
        N = X.shape[0]
        p = X.shape[1]
        adjRsquare = 1 - ((1 -  rSquare ) * (N - 1) / (N - p - 1))
        print "Adjusted R-Square::", adjRsquare
    
        # calculate standard errors
        # mean_absolute_error
        # mean_squared_error
        # median_absolute_error 
        # r2_score
        # explained_variance_score
        mse = metrics.mean_squared_error(y,results.predict(X))
        print mse
        print 'Residual Standard Error:', np.sqrt(mse)
    
        # OLS in Matrix : https://github.com/nsh87/regressors/blob/master/regressors/stats.py
        n = X.shape[0]
        X1 = np.hstack((np.ones((n, 1)), np.matrix(X)))    
        se_matrix = scipy.linalg.sqrtm(
            metrics.mean_squared_error(y, results.predict(X)) *
            np.linalg.inv(X1.T * X1)
        )
        print 'se',np.diagonal(se_matrix)
    
    #    https://github.com/nsh87/regressors/blob/master/regressors/stats.py
    #    http://regressors.readthedocs.io/en/latest/usage.html
    
        y_hat = results.predict(X)
        sse = np.sum((y_hat - y) ** 2)
        print 'Standard Square Error of the Model:', sse
    
    
    
    
    if __name__ == '__main__':
    
        # read file 
        fileData = readFile('Linear_regression.xlsx','Input Data')
    
        # list of independent variables 
        feature_cols = ['Price per week','Population of city','Monthly income of riders','Average parking rates per month']
    
        # build dependent & independent data set 
        X = fileData[feature_cols]
        y = fileData['Number of weekly riders']
    
        # Statsmodel - OLS 
    #    lr_statsmodel(X,y)
    
        # ScikitLearn - OLS 
        lr_scikit(X,y,feature_cols)
    

    My data-set

    Y   X1  X2  X3  X4
    City    Number of weekly riders Price per week  Population of city  Monthly income of riders    Average parking rates per month
    1   1,92,000    $15     18,00,000   $5,800  $50
    2   1,90,400    $15     17,90,000   $6,200  $50
    3   1,91,200    $15     17,80,000   $6,400  $60
    4   1,77,600    $25     17,78,000   $6,500  $60
    5   1,76,800    $25     17,50,000   $6,550  $60
    6   1,78,400    $25     17,40,000   $6,580  $70
    7   1,80,800    $25     17,25,000   $8,200  $75
    8   1,75,200    $30     17,25,000   $8,600  $75
    9   1,74,400    $30     17,20,000   $8,800  $75
    10  1,73,920    $30     17,05,000   $9,200  $80
    11  1,72,800    $30     17,10,000   $9,630  $80
    12  1,63,200    $40     17,00,000   $10,570 $80
    13  1,61,600    $40     16,95,000   $11,330 $85
    14  1,61,600    $40     16,95,000   $11,600 $100
    15  1,60,800    $40     16,90,000   $11,800 $105
    16  1,59,200    $40     16,30,000   $11,830 $105
    17  1,48,800    $65     16,40,000   $12,650 $105
    18  1,15,696    $102    16,35,000   $13,000 $110
    19  1,47,200    $75     16,30,000   $13,224 $125
    20  1,50,400    $75     16,20,000   $13,766 $130
    21  1,52,000    $75     16,15,000   $14,010 $150
    22  1,36,000    $80     16,05,000   $14,468 $155
    23  1,26,240    $86     15,90,000   $15,000 $165
    24  1,23,888    $98     15,95,000   $15,200 $175
    25  1,26,080    $87     15,90,000   $15,600 $175
    26  1,51,680    $77     16,00,000   $16,000 $190
    27  1,52,800    $63     16,10,000   $16,200 $200
    

    I've exhausted all my options and whatever I could make sense of. So any guidance on how to compute std error & p-values that is the same as per the statsmodel.api is appreciated.

    EDIT: I'm trying to find the std error & p-values for intercept and all the independent variables

    • Kshitij
      Kshitij almost 6 years
      Were you able to understand the difference here? Could this be because you are using a package outside of sklearn. You are using regressor for computations on top of sklearn results. It could be different our statsmodel does it.
    • irene
      irene over 5 years
      Old thread, but I also encountered the same problem. regressors uses the built-in method sklearn.metrics.mean_squared_error to compute for the MSE (which is used to compute for the p-values), but this uses a divisor of n instead of n-p, where n is the sample size and p is the number of features. This causes discrepancy with statsmodels in cases where p is not small compared to n.
    • Don Quixote
      Don Quixote almost 5 years
      @irene What are the scenarios where one method of calculation would be better than the other?
    • irene
      irene almost 5 years
      @DonQuixote statsmodels works better since it uses the divisor n-p to give an unbiased estimate. If n is very large, then you should get comparable results.