How to perform SMOTE with cross validation in sklearn in python

10,236

Solution 1

You need to perform SMOTE within each fold. Accordingly, you need to avoid train_test_split in favour of KFold:

from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score

kf = KFold(n_splits=5)

for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
    X_train = X[train_index]
    y_train = y[train_index]  # Based on your code, you might need a ravel call here, but I would look into how you're generating your y
    X_test = X[test_index]
    y_test = y[test_index]  # See comment on ravel and  y_train
    sm = SMOTE()
    X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
    model = ...  # Choose a model here
    model.fit(X_train_oversampled, y_train_oversampled )  
    y_pred = model.predict(X_test)
    print(f'For fold {fold}:')
    print(f'Accuracy: {model.score(X_test, y_test)}')
    print(f'f-score: {f1_score(y_test, y_pred)}')

You can also, for example, append the scores to a list defined outside.

Solution 2

from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import SMOTE

cv = StratifiedKFold(n_splits=5)
for train_idx, test_idx, in cv.split(X, y):
    X_train, y_train = X[train_idx], y[train_idx]
    X_test, y_test = X[test_idx], y[test_idx]
    X_train, y_train = SMOTE().fit_sample(X_train, y_train)
    ....
Share:
10,236
EmJ
Author by

EmJ

Updated on July 28, 2022

Comments

  • EmJ
    EmJ almost 2 years

    I have a highly imbalanced dataset and would like to perform SMOTE to balance the dataset and perfrom cross validation to measure the accuracy. However, most of the existing tutorials make use of only single training and testing iteration to perfrom SMOTE.

    Therefore, I would like to know the correct procedure to perfrom SMOTE using cross-validation.

    My current code is as follows. However, as mentioned above it only uses single iteration.

    from imblearn.over_sampling import SMOTE
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
    sm = SMOTE(random_state=2)
    X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())
    clf_rf = RandomForestClassifier(n_estimators=25, random_state=12)
    clf_rf.fit(x_train_res, y_train_res)
    

    I am happy to provide more details if needed.

  • gmds
    gmds about 5 years
    As a note: you may wish to use StratifiedKFold instead, as in the other answer, since you presumably have an imbalanced class problem.
  • EmJ
    EmJ about 5 years
    thanks a lot. I also have a y value. In that case, how can I change this in enumerate(kf.split(X), 1):?
  • gmds
    gmds about 5 years
    @Emi you shouldn't need to modify that. What kf.split does is just take the size of X (how many rows it has) to determine how to generate indices for each fold. Since your y should be the same size as X, you won't need to provide it. That said, you can do kf.split(X, y) and it will have the same effect.
  • Perl
    Perl over 4 years
    @gmds A small question: why didn't you fit the model on the oversampled data ``` X_train_oversampled ``` and y_train_oversampled, and you rather did model.fit(X_train, y_train) ?
  • gmds
    gmds over 4 years
    @Hiyam That was actually my mistake, thanks! Will edit.
  • Ross_you
    Ross_you over 3 years
    @gmds I came across your answer and have a fundamental question about it. Based on your answer, for each fold, we will calculate accuracy and f1_score. I assume that we can use any other type of metric as well. now when the for loop is done, what is finally score? is it the average of all scores we found for all folds? if now we want to run the model with different parameters (grid-search), we should compare the average score for each grid search?