How to perform feature selection with gridsearchcv in sklearn in python


Solution 1

Basically you want to fine tune the hyper parameter of your classifier (with Cross validation) after feature selection using recursive feature elimination (with Cross validation).

Pipeline object is exactly meant for this purpose of assembling the data transformation and applying estimator.

May be you could use a different model (GradientBoostingClassifier, etc. ) for your final classification. It would be possible with the following approach:

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, 

from sklearn.pipeline import Pipeline

#this is the classifier used for feature selection
clf_featr_sele = RandomForestClassifier(n_estimators=30, 
rfecv = RFECV(estimator=clf_featr_sele, 
              scoring = 'roc_auc')

#you can have different classifier for your final classifier
clf = RandomForestClassifier(n_estimators=10, 
CV_rfc = GridSearchCV(clf, 
                      cv= 5, scoring = 'roc_auc')

pipeline  = Pipeline([('feature_sele',rfecv),
                      ('clf_cv',CV_rfc)]), y_train)

Now, you can apply this pipeline (Including feature selection) for test data.

Solution 2

You can do what you want by prefixing the names of the parameters you want to pass to the estimator with 'estimator__'.

X = df[[my_features]]
y = df[gold_standard]

clf = RandomForestClassifier(random_state=0, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(3), scoring='roc_auc')

param_grid = { 
    'estimator__n_estimators': [200, 500],
    'estimator__max_features': ['auto', 'sqrt', 'log2'],
    'estimator__max_depth' : [4,5,6,7,8],
    'estimator__criterion' :['gini', 'entropy']
k_fold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)

CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')

X_train, X_test, y_train, y_test = train_test_split(X, y), y_train)

Output on fake data I made:

{'estimator__n_estimators': 200, 'estimator__max_depth': 6, 'estimator__criterion': 'entropy', 'estimator__max_features': 'auto'}
RFECV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
   estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='entropy', max_depth=6, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=200, n_jobs=None, oob_score=False, random_state=0,
            verbose=0, warm_start=False),
   min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1,

Solution 3

You just need to pass the Recursive Feature Elimination Estimator directly into the GridSearchCV object. Something like this should work

X = df[my_features] #all my features
y = df['gold_standard'] #labels

clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='auc_roc')

param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

#------------- Just pass your RFECV object as estimator here directly --------#

CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc'), y_train)
    I am using recursive feature elimination with cross validation (rfecv) as a feature selector for randomforest classifier as follows.

    X = df[[my_features]] #all my features
    y = df['gold_standard'] #labels
    clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
    rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc'),y)
    print("Optimal number of features : %d" % rfecv.n_features_)

    I am also performing GridSearchCV as follows to tune the hyperparameters of RandomForestClassifier as follows.

    X = df[[my_features]] #all my features
    y = df['gold_standard'] #labels
    x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)
    rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced')
    param_grid = { 
        'n_estimators': [200, 500],
        'max_features': ['auto', 'sqrt', 'log2'],
        'max_depth' : [4,5,6,7,8],
        'criterion' :['gini', 'entropy']
    k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
    CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc'), y_train)
    pred = CV_rfc.predict_proba(x_test)[:,1]
    print(roc_auc_score(y_test, pred))

    However, I am not clear how to merge feature selection (rfecv) with GridSearchCV.


    When I run the answer suggested by @Gambit I got the following error:

    ValueError: Invalid parameter criterion for estimator RFECV(cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=False),
       estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
                criterion='gini', max_depth=None, max_features='auto',
                max_leaf_nodes=None, min_impurity_decrease=0.0,
                min_impurity_split=None, min_samples_leaf=1,
                min_samples_split=2, min_weight_fraction_leaf=0.0,
                n_estimators='warn', n_jobs=None, oob_score=False,
                random_state=42, verbose=0, warm_start=False),
       min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1,
       verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.

    I could resolve the above issue by using estimator__ in the param_grid parameter list.

    My question now is How to use the selected features and parameters in x_test to verify if the model works fine with unseen data. How can I obtain the best features and train it with the optimal hyperparameters?

    I am happy to provide more details if needed.

  • EmJ
    EmJ about 5 years
    thanks a lot for the great answer. Is there a way to get the selected features from rfecv? Moreover, how can we validate X_test using the selected features? Looking forward to hearing from you. Thank you very much once again :)
  • EmJ
    EmJ about 5 years
    I tried to run your code. however, i got the following error. ValueError: Invalid parameter criterion for estimator. Can you please tell me how to resolve this issue. Thank you very much :)
  • EmJ
    EmJ about 5 years
    thanks a lot for your great answer. could you please tell me how to use X_test to validate the results? Looking forward to hearing from you. Thank you very much :)
  • gmds
    gmds about 5 years
    roc_auc_score(y_test, CV_rfc.predict_proba(X_test))?
  • EmJ
    EmJ about 5 years
    thanks a lot. one last question. I would like to see what are the features selected through this process. Is it possible to get those selecetd features? :)
  • EmJ
    EmJ about 5 years
    is it correct to get the selected number of features as rfecv.n_features_. please kindly correct me if I am wrong. Looking forward to hearing from you. Thank you very much :)
  • EmJ
    EmJ about 5 years
    thanks a lot for the great answer. why do you think it is important to do feature selection using a different classifier? Is there any reason for it? Lokking forward to hearing from you. thank you very much :)
  • Venkatachalam
    Venkatachalam about 5 years
    As you know, feature selection can be done by comparatively simple classsifer. But when you want to do the final classification you would be more interested in performance and hence you might go for mlp classifier or some thing like that .
  • EmJ
    EmJ about 5 years
    thanks a lot. just a quick question. what are the simple classifiers that you would recommend for feature selection? Looking forward to hearing from you :)
  • Venkatachalam
    Venkatachalam about 5 years
    I would start with logisticRegresssion, then sgdClassifier, ridgeClassifier,decisionTree, etc.
  • EmJ
    EmJ about 5 years
    thanks a lot. what algorithms would you recommend for parameter tuning? Moreover, could you please tell me if you know answers for the following question…
  • EmJ
    EmJ about 5 years
    it is possible to get f1 score of, y_train)? Looking forward to hearing from you. :)
  • aasthetic
    aasthetic over 3 years
    Hello, I applied this method but I see that the model, after running the pipeline has selected more features than what actually came from rfecv
  • rajesh
    rajesh almost 3 years
    Should not it be- pipeline = Pipeline([('feature_sele',rfecv), ('clf_cv',CV_rfc)]) CV_rfc = GridSearchCV(pipeline, param_grid={clf_cv__max_depth:[2,4]}, ...),Y_train) CV_rfc.predict(X_test)
  • Yev Guyduy
    Yev Guyduy almost 2 years
    RFE is a wrapper for an estimator, i think doing what this answer does actually does not influence final model, in other words, RFE is not passing anything to the MODEL part, would have to validate by number of features chosen, select say only 1 forcibly by changing n_features_ to 1 and then expand to whatever number, say 10, if scores are the same then this pipeline is not working