Multi-output regressor and sklearn's RFE module

611

RFE does not support multi-label format because each target would result in selection of different combination of input features. Hence, you need to create individual RFE for each target variable.

For example:

rfe = {}
for i in range(my_y.shape[1]):
    rfe[i] = RFE(regress, 300) 
    rfe[i].fit(my_X, my_y[:,i])

feature_final = rfe[0].transform(my_X)
feature_final.shape
# (5000, 300)

Note from documentation of cross_val_predict:

It is not appropriate to pass these predictions into an evaluation metric. Use cross_validate to measure generalization error.

Share:
611

Related videos on Youtube

Blade
Author by

Blade

Updated on December 02, 2022

Comments

  • Blade
    Blade over 1 year

    I was wondering if it is possible to do RFE using a multi-variate estimator with sklearn package. I checked the documentation and I can't find anything for or against it. Here is the minimal code:

    import sklearn.linear_model as skl
    from sklearn.feature_selection import RFE
    from scat import *
    from sklearn import metrics, model_selection
    
    # -- params
    n_folds = 5
    N       = 5000
    # -- regressor
    regress = skl.RidgeCV(alphas=[1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1])
    
    # -- cross-validation
    P = np.random.permutation(N).reshape((n_folds, -1))
    cross_val_folds = []
    
    for i_fold in range(n_folds):
        fold = (np.concatenate(P[np.arange(n_folds) != i_fold], axis=0), P[i_fold])
        cross_val_folds.append(fold)
    
    my_X = np.random.normal(0,1,[N, 315])
    my_y = np.random.normal(0,1,[N, 2])
    my_pred = model_selection.cross_val_predict(regress, X=my_X, y=my_y, cv=cross_val_folds)
    
    MAE = metrics.mean_absolute_error(my_y, my_pred)
    RMSE = np.sqrt(metrics.mean_squared_error(my_y, my_pred))
    print('MAE: {}, RMSE: {}'.format(round(MAE, 5), round(RMSE, 5)))
    
    rfe = RFE(regress, 300)
    feature_final = rfe.fit_transform(my_X, my_y)
    

    but I get the following error when testing it

    ValueError: bad input shape (5000, 2)

    which doesn't provide much information.


    Edits:

    Apparently, using RFE function, y goes through

    y = column_or_1d(y, warn=True)
    

    which requires y to be a vector. Is there a workaround for this?

    • Kinnectus
      Kinnectus almost 6 years
      Additionally, by giving your computer the public IP of your connection I hope this also doesn't put that device in your modem/router DMZ and, essentially, open all ports to that device... that's just looking for trouble with a Windows box (if you haven't taken due care to manage the firewall rules)...
  • Blade
    Blade almost 5 years
    Thanks for the comment on RFE's support of multi-label format. I think creating 2 RFE's would be meaningless in this scenario. But a good idea would be doing RFE's based on different target variables in sequence.