How to perform feature selection with gridsearchcv in sklearn in python

python machine-learning scikit-learn data-science grid-search

11,353

Solution 1

Basically you want to fine tune the hyper parameter of your classifier (with Cross validation) after feature selection using recursive feature elimination (with Cross validation).

Pipeline object is exactly meant for this purpose of assembling the data transformation and applying estimator.

May be you could use a different model (GradientBoostingClassifier, etc. ) for your final classification. It would be possible with the following approach:

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=42)


from sklearn.pipeline import Pipeline

#this is the classifier used for feature selection
clf_featr_sele = RandomForestClassifier(n_estimators=30, 
                                        random_state=42,
                                        class_weight="balanced") 
rfecv = RFECV(estimator=clf_featr_sele, 
              step=1, 
              cv=5, 
              scoring = 'roc_auc')

#you can have different classifier for your final classifier
clf = RandomForestClassifier(n_estimators=10, 
                             random_state=42,
                             class_weight="balanced") 
CV_rfc = GridSearchCV(clf, 
                      param_grid={'max_depth':[2,3]},
                      cv= 5, scoring = 'roc_auc')

pipeline  = Pipeline([('feature_sele',rfecv),
                      ('clf_cv',CV_rfc)])

pipeline.fit(X_train, y_train)
pipeline.predict(X_test)

Now, you can apply this pipeline (Including feature selection) for test data.

Solution 2

You can do what you want by prefixing the names of the parameters you want to pass to the estimator with 'estimator__'.

X = df[[my_features]]
y = df[gold_standard]

clf = RandomForestClassifier(random_state=0, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(3), scoring='roc_auc')

param_grid = { 
    'estimator__n_estimators': [200, 500],
    'estimator__max_features': ['auto', 'sqrt', 'log2'],
    'estimator__max_depth' : [4,5,6,7,8],
    'estimator__criterion' :['gini', 'entropy']
}
k_fold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)

CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')

X_train, X_test, y_train, y_test = train_test_split(X, y)

CV_rfc.fit(X_train, y_train)

Output on fake data I made:

{'estimator__n_estimators': 200, 'estimator__max_depth': 6, 'estimator__criterion': 'entropy', 'estimator__max_features': 'auto'}
0.5653035605690997
RFECV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
   estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='entropy', max_depth=6, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=200, n_jobs=None, oob_score=False, random_state=0,
            verbose=0, warm_start=False),
   min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1,
   verbose=0)

Solution 3

You just need to pass the Recursive Feature Elimination Estimator directly into the GridSearchCV object. Something like this should work

X = df[my_features] #all my features
y = df['gold_standard'] #labels

clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='auc_roc')

param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

#------------- Just pass your RFECV object as estimator here directly --------#

CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')


CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)
print(CV_rfc.best_score_)
print(CV_rfc.best_estimator_)

11,353

Author by

EmJ

Updated on June 06, 2022

Comments

EmJ almost 2 years

I am using recursive feature elimination with cross validation (rfecv) as a feature selector for randomforest classifier as follows.

X = df[[my_features]] #all my features
y = df['gold_standard'] #labels

clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc')
rfecv.fit(X,y)

print("Optimal number of features : %d" % rfecv.n_features_)
features=list(X.columns[rfecv.support_])

I am also performing GridSearchCV as follows to tune the hyperparameters of RandomForestClassifier as follows.

X = df[[my_features]] #all my features
y = df['gold_standard'] #labels

x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)

rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced')
param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')
CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)
print(CV_rfc.best_score_)
print(CV_rfc.best_estimator_)

pred = CV_rfc.predict_proba(x_test)[:,1]
print(roc_auc_score(y_test, pred))

However, I am not clear how to merge feature selection (rfecv) with GridSearchCV.

EDIT:

When I run the answer suggested by @Gambit I got the following error:

ValueError: Invalid parameter criterion for estimator RFECV(cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=False),
   estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators='warn', n_jobs=None, oob_score=False,
            random_state=42, verbose=0, warm_start=False),
   min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1,
   verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.

I could resolve the above issue by using estimator__ in the param_grid parameter list.

My question now is How to use the selected features and parameters in x_test to verify if the model works fine with unseen data. How can I obtain the best features and train it with the optimal hyperparameters?

I am happy to provide more details if needed.

EmJ about 5 years

thanks a lot for the great answer. Is there a way to get the selected features from rfecv? Moreover, how can we validate X_test using the selected features? Looking forward to hearing from you. Thank you very much once again :)
EmJ about 5 years

I tried to run your code. however, i got the following error. ValueError: Invalid parameter criterion for estimator. Can you please tell me how to resolve this issue. Thank you very much :)
EmJ about 5 years

thanks a lot for your great answer. could you please tell me how to use X_test to validate the results? Looking forward to hearing from you. Thank you very much :)
gmds about 5 years

roc_auc_score(y_test, CV_rfc.predict_proba(X_test))?
EmJ about 5 years

thanks a lot. one last question. I would like to see what are the features selected through this process. Is it possible to get those selecetd features? :)
EmJ about 5 years

is it correct to get the selected number of features as rfecv.n_features_. please kindly correct me if I am wrong. Looking forward to hearing from you. Thank you very much :)
EmJ about 5 years

thanks a lot for the great answer. why do you think it is important to do feature selection using a different classifier? Is there any reason for it? Lokking forward to hearing from you. thank you very much :)
Venkatachalam about 5 years

As you know, feature selection can be done by comparatively simple classsifer. But when you want to do the final classification you would be more interested in performance and hence you might go for mlp classifier or some thing like that .
EmJ about 5 years

thanks a lot. just a quick question. what are the simple classifiers that you would recommend for feature selection? Looking forward to hearing from you :)
Venkatachalam about 5 years

I would start with logisticRegresssion, then sgdClassifier, ridgeClassifier,decisionTree, etc.
EmJ about 5 years

thanks a lot. what algorithms would you recommend for parameter tuning? Moreover, could you please tell me if you know answers for the following question stackoverflow.com/questions/55649352/…
EmJ about 5 years

it is possible to get f1 score of pipeline.fit(X_train, y_train)? Looking forward to hearing from you. :)
aasthetic over 3 years

Hello, I applied this method but I see that the model, after running the pipeline has selected more features than what actually came from rfecv
rajesh almost 3 years

Should not it be- pipeline = Pipeline([('feature_sele',rfecv), ('clf_cv',CV_rfc)]) CV_rfc = GridSearchCV(pipeline, param_grid={clf_cv__max_depth:[2,4]}, ...) CV_rfc.fit(X_train,Y_train) CV_rfc.predict(X_test)
Yev Guyduy almost 2 years

RFE is a wrapper for an estimator, i think doing what this answer does actually does not influence final model, in other words, RFE is not passing anything to the MODEL part, would have to validate by number of features chosen, select say only 1 forcibly by changing n_features_ to 1 and then expand to whatever number, say 10, if scores are the same then this pipeline is not working