How does LassoCV in scikit-learn partition data?
If you use sklearn.cross_validation.cross_val_score
with a sklearn.linear_model.LassoCV
object, then you are performing nested cross-validation. cross_val_score
will divide your data into train and test sets according to how you specify the folds (which can be done with objects such as sklearn.cross_validation.KFold
). The train set will be passed to the LassoCV
, which itself performs another splitting of the data in order to choose the right penalty. This, it seems, corresponds to the setting you are seeking.
import numpy as np
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.linear_model import LassoCV
X = np.random.randn(20, 10)
y = np.random.randn(len(X))
cv_outer = KFold(len(X), n_folds=5)
lasso = LassoCV(cv=3) # cv=3 makes a KFold inner splitting with 3 folds
scores = cross_val_score(lasso, X, y, cv=cv_outer)
Answer: no, LassoCV
will not do all the work for you, and you have to use it in conjunction with cross_val_score
to obtain what you want. This is at the same time the reasonable way of implementing such objects, since we can also be interested in only fitting a hyperparameter optimized LassoCV
without necessarily evaluating it directly on another set of held out data.
Sirrah
Updated on June 14, 2022Comments
-
Sirrah almost 2 years
I am performing linear regression using the Lasso method in sklearn.
According to their guidance, and that which I have seen elsewhere, instead of simply conducting cross validation on all of the training data it is advised to split it up into more traditional training set / validation set partitions.
The Lasso is thus trained on the training set and then the hyperparameter alpha is tuned on the basis of results from cross validation of the validation set. Finally, the accepted model is used on the test set to give a realistic view oh how it will perform in reality. Seperating the concerns out here is a preventative measure against overfitting.
Actual Question
Does Lasso CV conform to the above protocol or does it just somehow train the model paramaters and hyperparameters on the same data and/or during the same rounds of CV?
Thanks.
-
Oliver Angelil over 6 yearsjust to confirm: the only purpose of the inner splitting is to select the "best" hyper-parameter C in LassoCV? And if the model is not present in this list, then the recommended way to do hyper-parameter tuning (say for SVR) is with GridSearchCV or RandomizedSearchCV? So the outer CV does not improve the model, but rather just checks how to performs on never-seen-before data? If one was using simple multiple linear regression (no hyper-parameters), the model cannot be tuned for general performance?
-
eickenberg over 6 yearsAffirmative to all these questions. For the last question: A way to tweak the model is by including columns/features or not. If you use a
sklearn.pipeline.Pipeline
, you could prepend a feature selector, e.g.sklearn.prepreprocessing.SelectKBest
to your OLS in a pipeline and use this pipeline inGridSearchCV
, where the latter checks different numbers fork
. -
Austin over 6 yearsHey, after you nest
LassoCV
insidecross_val_score
and run it on the training set, is there any way to inspect the fitted parameter(s) to re-run them on the test set? -
Austin over 6 yearsAlso, if you want to use RMSE scoring should you use
cross_val_score
+Lasso
+GridSearchCV
instead ofcross_val_score
+LassoCV
for nested cross-validation? -
eickenberg almost 5 yearsNote: Somebody tried to edit this post to reflect the current
sklearn
API, which uses thesklearn.model_selection.cross_val_score/KFold
instead ofsklearn.cross_validation.cross_val_score/KFold
, and usesn_splits
instead ofn_folds
. This edit was rejected by enough reviewers to reach a decision, but is not false. I'll leave the answer as is, because I think it still works (though with a deprecation warning), but if anybody wants to give editing another stab, please go ahead.