Resampling in scikit-learn and/or pandas

10,709

Solution 1

If you are open to importing a library, I find the imbalanced-learn library useful when addressing resampling. Here the categorical variable is the target 'y' and the data to re-sample on is 'X'. In the example below fish are resampled to equal the number of dogs, 3:3.

The code is slightly modified from the docs on imbalance-learn: 2.1.1. Naive random over-sampling. You can use this method with numeric data and strings.

import numpy as np  
from collections import Counter  
from imblearn.over_sampling import RandomOverSampler  

y = np.array([1,1,0,0,0]); # Fish / Dog  
print('target:\n', y)  
X = np.array([['red fish'],['blue fish'],['dog'],['dog'],['dog']]);  
print('data:\n',X);  

print('Original dataset shape {}'.format(Counter(y))) # Original dataset shape Counter({1: 900, 0: 100})  
print(type(X)); print(X);  
print(y);  

ros = RandomOverSampler(ratio='auto', random_state=42);  
X_res, y_res = ros.fit_sample(X, y);  

print('Resampled dataset shape {}'.format(Counter(y_res))) # Resampled dataset shape Counter({0: 900, 1: 900});  
print(type(X_res)); print(X_res); print(y_res);  

Solution 2

My stab at a function to do what I want is below. Hope this is helpful to someone else.

X and y are assumed to be a Pandas DataFrame and Series respectively.

def resample(X, y, sample_type=None, sample_size=None, class_weights=None, seed=None):

    # Nothing to do if sample_type is 'abs' or not set.  sample_size should then be int
    # If sample type is 'min' or 'max' then sample_size should be float
    if sample_type == 'min':
        sample_size_ = np.round(sample_size * y.value_counts().min()).astype(int)
    elif sample_type == 'max':
        sample_size_ = np.round(sample_size * y.value_counts().max()).astype(int)
    else:
        sample_size_ = max(int(sample_size), 1)

    if seed is not None:
        np.random.seed(seed)

    if class_weights is None:
        class_weights = dict()

    X_resampled = pd.DataFrame()

    for yi in y.unique():
        size = np.round(sample_size_ * class_weights.get(yi, 1.)).astype(int)

        X_yi = X[y == yi]
        sample_index = np.random.choice(X_yi.index, size=size)
        X_resampled = X_resampled.append(X_yi.reindex(sample_index))

    return X_resampled

Solution 3

Stratified sampling means that the class distribution is preserved. If you are looking for this, you can still use StratifiedKFold and StratifiedShuffleSplit, as long as you have a categorical variable for which you want to ensure to have the same distribution in each fold. Just use the variable instead of the target variable. For example if you have a categorical variable in column i,

skf = cross_validation.StratifiedKFold(X[:,i])

However if I understand you correctly, you want to resample to a certain target distribution (e.g. 50/50) of one of the categorical features. I guess you would have to come up with your own method to get such a sample (split the dataset by variable value, then take same number of random samples from each split). If your main motivation is to balance the training set for a classifier, a trick could be to adjust the sample_weights. You can set the weights so that they balance the training set according to the desired variable:

sample_weights = sklearn.preprocessing.balance_weights(X[:,i])
clf = svm.SVC()
clf_weights.fit(X, y, sample_weight=sample_weights)

For a non-uniform target distribution, you would have to adjust the sample_weights accordingly.

Share:
10,709

Related videos on Youtube

user1507844
Author by

user1507844

Updated on September 27, 2022

Comments

  • user1507844
    user1507844 over 1 year

    Is there a built in function in either Pandas or Scikit-learn for resampling according to a specified strategy? I want to resample my data based on a categorical variable.

    For example, if my data has 75% men and 25% women, but I'd like to train my model on 50% men and 50% women. (I'd also like to be able to generalize to cases that aren't 50/50)

    What I need is something that resamples my data according to specified proportions.

  • user-asterix
    user-asterix over 3 years
    Can you please provide a few examples with various parameters for illustration.