How is the R2 value in Scikit learn calculated?

python machine-learning statistics scikit-learn

36,869

Solution 1

The R^2 in scikit learn is essentially the same as what is described in the wikipedia article on the coefficient of determination (grep for "the most general definition"). It is 1 - residual sum of square / total sum of squares.

The big difference between a classical stats setting and what you usually try to do with machine learning, is that in machine learning you evaluate your score on unseen data, which can lead to results outside [0,1]. If you apply R^2 to the same data you used to fit your model, it will lie within [0, 1]

Solution 2

Since R^2 = 1 - RSS/TSS, the only case where RSS/TSS > 1 happens when our model is even worse than the worst model assumed (which is the absolute mean model).

here RSS = sum of squares of difference between actual values(yi) and predicted values(yi^) and TSS = sum of squares of difference between actual values (yi) and mean value (Before applying Regression). So you can imagine TSS representing the best(actual) model, and RSS being in between our best model and the worst absolute mean model in which case we'll get RSS/TSS < 1. If our model is even worse than the worst mean model then in that case RSS > TSS(Since difference between actual observation and mean value < difference predicted value and actual observation).

Check here for better intuition with visual representation: https://ragrawal.wordpress.com/2017/05/06/intuition-behind-r2-and-other-regression-evaluation-metrics/

36,869

Author by

joeally

Updated on July 25, 2022

Comments

joeally almost 2 years

The R^2 value returned by scikit learn (metrics.r2_score()) can be negative. The docs say:

"Unlike most other scores, R² score may be negative (it need not actually be the square of a quantity R)."

However the wikipedia article on R^2 mentions no R (not squared) quantity. Perhaps it uses absolute differences instead of square differences. I really have no idea