How to perform feature selection with gridsearchcv in sklearn in python
Solution 1
Basically you want to fine tune the hyper parameter of your classifier (with Cross validation) after feature selection using recursive feature elimination (with Cross validation).
Pipeline object is exactly meant for this purpose of assembling the data transformation and applying estimator.
May be you could use a different model (GradientBoostingClassifier
, etc. ) for your final classification. It would be possible with the following approach:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33,
random_state=42)
from sklearn.pipeline import Pipeline
#this is the classifier used for feature selection
clf_featr_sele = RandomForestClassifier(n_estimators=30,
random_state=42,
class_weight="balanced")
rfecv = RFECV(estimator=clf_featr_sele,
step=1,
cv=5,
scoring = 'roc_auc')
#you can have different classifier for your final classifier
clf = RandomForestClassifier(n_estimators=10,
random_state=42,
class_weight="balanced")
CV_rfc = GridSearchCV(clf,
param_grid={'max_depth':[2,3]},
cv= 5, scoring = 'roc_auc')
pipeline = Pipeline([('feature_sele',rfecv),
('clf_cv',CV_rfc)])
pipeline.fit(X_train, y_train)
pipeline.predict(X_test)
Now, you can apply this pipeline (Including feature selection) for test data.
Solution 2
You can do what you want by prefixing the names of the parameters you want to pass to the estimator with 'estimator__'
.
X = df[[my_features]]
y = df[gold_standard]
clf = RandomForestClassifier(random_state=0, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(3), scoring='roc_auc')
param_grid = {
'estimator__n_estimators': [200, 500],
'estimator__max_features': ['auto', 'sqrt', 'log2'],
'estimator__max_depth' : [4,5,6,7,8],
'estimator__criterion' :['gini', 'entropy']
}
k_fold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')
X_train, X_test, y_train, y_test = train_test_split(X, y)
CV_rfc.fit(X_train, y_train)
Output on fake data I made:
{'estimator__n_estimators': 200, 'estimator__max_depth': 6, 'estimator__criterion': 'entropy', 'estimator__max_features': 'auto'}
0.5653035605690997
RFECV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
criterion='entropy', max_depth=6, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=200, n_jobs=None, oob_score=False, random_state=0,
verbose=0, warm_start=False),
min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1,
verbose=0)
Solution 3
You just need to pass the Recursive Feature Elimination Estimator directly into the GridSearchCV
object. Something like this should work
X = df[my_features] #all my features
y = df['gold_standard'] #labels
clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='auc_roc')
param_grid = {
'n_estimators': [200, 500],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
}
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
#------------- Just pass your RFECV object as estimator here directly --------#
CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')
CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)
print(CV_rfc.best_score_)
print(CV_rfc.best_estimator_)
Comments
-
EmJ almost 2 years
I am using
recursive feature elimination with cross validation (rfecv)
as a feature selector forrandomforest classifier
as follows.X = df[[my_features]] #all my features y = df['gold_standard'] #labels clf = RandomForestClassifier(random_state = 42, class_weight="balanced") rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc') rfecv.fit(X,y) print("Optimal number of features : %d" % rfecv.n_features_) features=list(X.columns[rfecv.support_])
I am also performing
GridSearchCV
as follows to tune the hyperparameters ofRandomForestClassifier
as follows.X = df[[my_features]] #all my features y = df['gold_standard'] #labels x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0) rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced') param_grid = { 'n_estimators': [200, 500], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth' : [4,5,6,7,8], 'criterion' :['gini', 'entropy'] } k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0) CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc') CV_rfc.fit(x_train, y_train) print(CV_rfc.best_params_) print(CV_rfc.best_score_) print(CV_rfc.best_estimator_) pred = CV_rfc.predict_proba(x_test)[:,1] print(roc_auc_score(y_test, pred))
However, I am not clear how to merge feature selection (
rfecv
) withGridSearchCV
.EDIT:
When I run the answer suggested by @Gambit I got the following error:
ValueError: Invalid parameter criterion for estimator RFECV(cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=False), estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced', criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None, oob_score=False, random_state=42, verbose=0, warm_start=False), min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1, verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.
I could resolve the above issue by using
estimator__
in theparam_grid
parameter list.
My question now is How to use the selected features and parameters in
x_test
to verify if the model works fine with unseen data. How can I obtain thebest features
and train it with theoptimal hyperparameters
?I am happy to provide more details if needed.
-
EmJ about 5 yearsthanks a lot for the great answer. Is there a way to get the selected features from
rfecv
? Moreover, how can we validateX_test
using the selected features? Looking forward to hearing from you. Thank you very much once again :) -
EmJ about 5 yearsI tried to run your code. however, i got the following error.
ValueError: Invalid parameter criterion for estimator
. Can you please tell me how to resolve this issue. Thank you very much :) -
EmJ about 5 yearsthanks a lot for your great answer. could you please tell me how to use
X_test
to validate the results? Looking forward to hearing from you. Thank you very much :) -
gmds about 5 years
roc_auc_score(y_test, CV_rfc.predict_proba(X_test))
? -
EmJ about 5 yearsthanks a lot. one last question. I would like to see what are the features selected through this process. Is it possible to get those selecetd features? :)
-
EmJ about 5 yearsis it correct to get the selected number of features as
rfecv.n_features_
. please kindly correct me if I am wrong. Looking forward to hearing from you. Thank you very much :) -
EmJ about 5 yearsthanks a lot for the great answer. why do you think it is important to do feature selection using a different classifier? Is there any reason for it? Lokking forward to hearing from you. thank you very much :)
-
Venkatachalam about 5 yearsAs you know, feature selection can be done by comparatively simple classsifer. But when you want to do the final classification you would be more interested in performance and hence you might go for mlp classifier or some thing like that .
-
EmJ about 5 yearsthanks a lot. just a quick question. what are the
simple classifiers
that you would recommend for feature selection? Looking forward to hearing from you :) -
Venkatachalam about 5 yearsI would start with logisticRegresssion, then sgdClassifier, ridgeClassifier,decisionTree, etc.
-
EmJ about 5 yearsthanks a lot. what algorithms would you recommend for parameter tuning? Moreover, could you please tell me if you know answers for the following question stackoverflow.com/questions/55649352/…
-
EmJ about 5 yearsit is possible to get
f1
score ofpipeline.fit(X_train, y_train)
? Looking forward to hearing from you. :) -
aasthetic over 3 yearsHello, I applied this method but I see that the model, after running the pipeline has selected more features than what actually came from
rfecv
-
rajesh almost 3 yearsShould not it be- pipeline = Pipeline([('feature_sele',rfecv), ('clf_cv',CV_rfc)]) CV_rfc = GridSearchCV(pipeline, param_grid={clf_cv__max_depth:[2,4]}, ...) CV_rfc.fit(X_train,Y_train) CV_rfc.predict(X_test)
-
Yev Guyduy almost 2 yearsRFE is a wrapper for an estimator, i think doing what this answer does actually does not influence final model, in other words, RFE is not passing anything to the MODEL part, would have to validate by number of features chosen, select say only 1 forcibly by changing
n_features_
to 1 and then expand to whatever number, say 10, if scores are the same then this pipeline is not working