sklearn train_test_split - ValueError: Found input variables with inconsistent numbers of samples

python tensorflow keras scikit-learn

20,922

Solution 1

As you stated, labels orginal shape is (83292, 5) and once you applied MultiLabelBinarizer it became (5, 18).

train_test_split(X, y) function expect that X and y should have the same rows. E.g: 83292 datapoints available in your X and respective datapoints label should be available in your y variable. Hence, in your case it should be X and y shape should be (83292, 15) and (83292, 18).

Try this: Your MultiLabelBinarizer output having wrong dimension here. So, if your labels is a dataframe object, then you should apply mlb.fit_transform(labels.values.tolist()). this would produce the same no of rows as output here 83292.

Example of your labels should be like below format:

your y input can be like list of list or dataframe having one column which having list of values. Make sure you have X and y having same no of rows. You can represent multi-label multi-class y variable like below format. Or dataframe.shape should be (no_of_rows, 1)

[[1, 1, 25, 0, 0],
 [1, 1, 25, 0, 0],
 [1, 1, 25, 0, 0],
 [1, 1, 25, 0, 0],
 [1, 1, 25, 0, 0],
 [3, 5, 50, 0, 0],
 [3, 5, 50, 0, 0],
 [3, 5, 50, 0, 0],
 [3, 5, 50, 0, 0],
 [3, 5, 50, 0, 0]]

Solution 2

This means that the length of the various elements you're trying to split don't match.For X and y, sklearn will take the same indices, usually a random sample of 80% of the indices of your data. So, the lengths have to match.

Imagine it's trying to keep these indices. What would sklearn do when there's nothing at some index?

 0 1 0 0 1 0 1 0 0 1 0 1 0 1
 a b b a b a b a a b b b 
 ^   ^     ^ ^   ^   ^   ^ ^

Do this check to verify that the lengths match. Does this return True?

len(dataset) == len(labels)

20,922

Author by

rshah

I started learning programming, and the Object Oriented Programming concept through modifications of games. These games include Runescape (creating private, offshore-hosted servers) and Minecraft. Since these games have pre-written code, it enabled me to develop my problem-solving skills by working with code that has already been produced. Then I started developing minor 2D games in Python and Java, obviously none of which got past the alpha trial stage, and I went on to learning SQL, PHP and web design. I started a small hosting company called WireTrunk, lasting 3 months, but we shut it down after more cheaper hosting sites became available. Then I went on to developing websites for indie game companies such as Cliffedge Studios (now deprecated).

Updated on July 09, 2022

Comments

rshah almost 2 years

I have a multi-label classification problem, for which I looked online and saw that for one-hot encoding the labels it is best to use the MultiLabelBinarizer.

I use this for my labels (which i separate from the dataset itself) as follows:

ohe = MultiLabelBinarizer()
labels = ohe.fit_transform(labels)
train, test, train_labels, test_labels = train_test_split(dataset, labels, test_size=0.2) #80% train split

But it throws me this following error:

Traceback (most recent call last): 
  File "learn.py", line 114, in <module> 
    train, test, train_labels, test_labels = train_test_split(dataset, labels, test_size=0.2) #80% train split
  File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\model_selection\_split.py", line 2127, 
in train_test_split
    arrays = indexable(*arrays)
  File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\utils\validation.py", line 293, in indexable
    check_consistent_length(*result)
  File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\utils\validation.py", line 256, in check_consistent_length
    raise ValueError("Found input variables with inconsistent numbers of"
ValueError: Found input variables with inconsistent numbers of samples: [83292, 5]

EDIT: The labels dataset looks as follows (ignore the Interval column, this shouldnt be there and is not actually counted in the rows -- not sure why?):

          Movement  Distance  Speed  Delay  Loss 
Interval
0                1         1     25      0     0
2                1         1     25      0     0
4                1         1     25      0     0
6                1         1     25      0     0
8                1         1     25      0     0
...            ...       ...    ...    ...   ...
260              3         5     50      0     0
262              3         5     50      0     0
264              3         5     50      0     0
266              3         5     50      0     0
268              3         5     50      0     0

From this we can see that it is a multi-label multi-class classification problem. The shape of the dataset and labels before and after the Binarizer are as follows:

             Before             After
dataset      (83292, 15)        (83292, 15)
labels       (83292, 5)         (5, 18)

rshah almost 4 years

Will I perform this comparison before or after the one-hot encoding using MultiLabelBinarizer?
Nicolas Gervais almost 4 years

Before or after doesn't matter, because this doesn't change the length of the data.
Nicolas Gervais almost 4 years

According to the MultiLabelBinarizer documentation, a common mistake is to pass a list. Maybe this is where you went wrong?
rshah almost 4 years

By doing len(dataset) == len(labels) before using MultiLabelBinarizer it returns True, but returns false afterwards with the lengths being 83292 and 5 for dataset and labels, respectively.
Nicolas Gervais almost 4 years

According to the docs, the frequent mistake can be solved by doing fit_transform([labels])
rshah almost 4 years

I tried the solution you posed Nicolas, but it doesnt work. The length after doing this changes to 1 for labels. The datatype of labels before is pandas.core.frame.DataFrame and after the MultiLabelBinarizer this becomes numpy.ndarray.
rshah almost 4 years

Also, should I maybe be performing the MultiLabelBinarizer after doing the train_test_split? And would using the sckit.multilearn iterative_train_test_split be adviseable?
Nicolas Gervais almost 4 years

i think you need to select one column only, not an entire pandas dataframe
rshah almost 4 years

Thanks for the solution! However, the labels dataset goes from having shape (83292, 5) to (83292, 14) any idea why?
Narendra Prasath almost 4 years

@rshah does this solution working? It is a representation of binary for every label. There will be 14 unique classes available. that's why the shape is 14
rshah almost 4 years

It passes the binarizer without error, but I have 5 classes, but I am wondering why now the labels and dataset have the same shape (after binarizing the labels) of (83292, 14)
Narendra Prasath almost 4 years

@rshah can you check the total no of unique labels?
rshah almost 4 years

I checked what the output of ohe.classes_ was, and it is not what I expected: [ 0 1 2 3 4 5 6 7 10 25 50 100 150 200]
Narendra Prasath almost 4 years

@rshah This are available classes in your y label. that's correct I guess. Probably your question doesn't fit with your expectations. But, whatever the issue stated in your OP has been fixed now I guess.
rshah almost 4 years

Thanks. Maybe I have to work more on pre-processing.. perhaps change Movement to have values like 1, 2, 3, ... to split into more columns like Movement_1, Movement_2 etc.. so its easier for this.