How to perform under sampling in scikit learn?

python python-2.7 dataset scikit-learn sampling

22,863

Solution 1

I would choose to do this with Pandas DataFrame and numpy.random.choice. In that way it is easy to do random sampling to produce equally sized data-sets. An example:

import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randn(7, 4))
data['Healthy'] = [1, 1, 0, 0, 1, 1, 1]

This data has two non-healthy and five healthy samples. To randomly pick two samples from the healthy population you do:

healthy_indices = data[data.Healthy == 1].index
random_indices = np.random.choice(healthy_indices, 2, replace=False)
healthy_sample = data.loc[random_indices]

To automatically pick a subsample of the same size as the non-healthy group you can do:

sample_size = sum(data.Healthy == 0)  # Equivalent to len(data[data.Healthy == 0])
random_indices = np.random.choice(healthy_indices, sample_size, replace=False)

Solution 2

As a variant you can use stochastic method. Assume, you have got a dataset data which is a big number of tuples (X, Y), where Y is diseased eye information (0 or 1). You can prepare a wrapper for your dataset, which passes all non diseased eyes and passes diseased eyes with probability 0.3 / 0.7 (you need only 30% of diseased eyes from the dataset).

from random import random


def wrapper(data):
    prob = 0.3 / 0.7

    for X, Y in data:
        if Y == 0:
            yield X, Y
        else:
            if random() < prob:
                yield X, Y


# now you can use the wrapper to extract needed information
for X, Y in wrapper(your_dataset):
    print X, Y

Be careful, if you need to use this wrapper as a generator many times and want to have identical results, you have to set fixed random seed before using the function random(). More about it: https://docs.python.org/2/library/random.html

Solution 3

You can use the np.random.choice for a naive under sampling as suggested previously, but an issue can be that some of your random samples are very similar and thus misrepresents the data set.

A better option is to use the imbalanced-learn package that has multiple options for balancing a dataset. A good tutorial and description of these can be found here.

The package lists a few good options for under sampling (from their github):

Random majority under-sampling with replacement

Extraction of majority-minority Tomek links

Under-sampling with Cluster Centroids

NearMiss-(1 & 2 & 3)

Condensed Nearest Neighbour

One-Sided Selection

Neighboorhood Cleaning Rule

Edited Nearest Neighbours

Instance Hardness Threshold

Repeated Edited Nearest Neighbours

AllKNN

22,863

Author by

Gaurav Patil

Updated on March 04, 2020

Comments

Gaurav Patil about 4 years

We have a retinal dataset wherein the diseased eye information constitutes 70 percent of the information whereas the non diseased eye constitutes the remaining 30 percent.We want a dataset wherein the diseased as well as the non diseased samples should be equal in number. Is there any function available with the help of which we can do the same?
Wboy almost 6 years

Please do correct me if i'm wrong, but to pick a subsample of the same size as the non healthy group after picking the healthy group, wouldnt it be: ` not_healthy = df[df.Healthy == 0].index random_indices = np.random.choice(not_healthy, sum(data['healthy']), replace=False) renew_sample = data.loc[random_indices]`