How does LassoCV in scikit-learn partition data?

python scikit-learn regression cross-validation

10,610

If you use sklearn.cross_validation.cross_val_score with a sklearn.linear_model.LassoCV object, then you are performing nested cross-validation. cross_val_score will divide your data into train and test sets according to how you specify the folds (which can be done with objects such as sklearn.cross_validation.KFold). The train set will be passed to the LassoCV, which itself performs another splitting of the data in order to choose the right penalty. This, it seems, corresponds to the setting you are seeking.

import numpy as np
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.linear_model import LassoCV

X = np.random.randn(20, 10)
y = np.random.randn(len(X))

cv_outer = KFold(len(X), n_folds=5)
lasso = LassoCV(cv=3)  # cv=3 makes a KFold inner splitting with 3 folds

scores = cross_val_score(lasso, X, y, cv=cv_outer)

Answer: no, LassoCV will not do all the work for you, and you have to use it in conjunction with cross_val_score to obtain what you want. This is at the same time the reasonable way of implementing such objects, since we can also be interested in only fitting a hyperparameter optimized LassoCV without necessarily evaluating it directly on another set of held out data.

10,610

Author by

Sirrah

Updated on June 14, 2022

Comments

Sirrah almost 2 years

I am performing linear regression using the Lasso method in sklearn.

According to their guidance, and that which I have seen elsewhere, instead of simply conducting cross validation on all of the training data it is advised to split it up into more traditional training set / validation set partitions.

The Lasso is thus trained on the training set and then the hyperparameter alpha is tuned on the basis of results from cross validation of the validation set. Finally, the accepted model is used on the test set to give a realistic view oh how it will perform in reality. Seperating the concerns out here is a preventative measure against overfitting.

Actual Question

Does Lasso CV conform to the above protocol or does it just somehow train the model paramaters and hyperparameters on the same data and/or during the same rounds of CV?

Thanks.
Oliver Angelil over 6 years

just to confirm: the only purpose of the inner splitting is to select the "best" hyper-parameter C in LassoCV? And if the model is not present in this list, then the recommended way to do hyper-parameter tuning (say for SVR) is with GridSearchCV or RandomizedSearchCV? So the outer CV does not improve the model, but rather just checks how to performs on never-seen-before data? If one was using simple multiple linear regression (no hyper-parameters), the model cannot be tuned for general performance?
eickenberg over 6 years

Affirmative to all these questions. For the last question: A way to tweak the model is by including columns/features or not. If you use a sklearn.pipeline.Pipeline, you could prepend a feature selector, e.g. sklearn.prepreprocessing.SelectKBest to your OLS in a pipeline and use this pipeline in GridSearchCV, where the latter checks different numbers for k.
Austin over 6 years

Hey, after you nest LassoCV inside cross_val_score and run it on the training set, is there any way to inspect the fitted parameter(s) to re-run them on the test set?
Austin over 6 years

Also, if you want to use RMSE scoring should you use cross_val_score+Lasso+GridSearchCV instead of cross_val_score+LassoCV for nested cross-validation?
eickenberg almost 5 years

Note: Somebody tried to edit this post to reflect the current sklearn API, which uses the sklearn.model_selection.cross_val_score/KFold instead of sklearn.cross_validation.cross_val_score/KFold, and uses n_splits instead of n_folds. This edit was rejected by enough reviewers to reach a decision, but is not false. I'll leave the answer as is, because I think it still works (though with a deprecation warning), but if anybody wants to give editing another stab, please go ahead.