sklearn train_test_split - ValueError: Found input variables with inconsistent numbers of samples

20,922

Solution 1

As you stated, labels orginal shape is (83292, 5) and once you applied MultiLabelBinarizer it became (5, 18).

train_test_split(X, y) function expect that X and y should have the same rows. E.g: 83292 datapoints available in your X and respective datapoints label should be available in your y variable. Hence, in your case it should be X and y shape should be (83292, 15) and (83292, 18).

Try this: Your MultiLabelBinarizer output having wrong dimension here. So, if your labels is a dataframe object, then you should apply mlb.fit_transform(labels.values.tolist()). this would produce the same no of rows as output here 83292.

Example of your labels should be like below format:

your y input can be like list of list or dataframe having one column which having list of values. Make sure you have X and y having same no of rows. You can represent multi-label multi-class y variable like below format. Or dataframe.shape should be (no_of_rows, 1)

[[1, 1, 25, 0, 0],
 [1, 1, 25, 0, 0],
 [1, 1, 25, 0, 0],
 [1, 1, 25, 0, 0],
 [1, 1, 25, 0, 0],
 [3, 5, 50, 0, 0],
 [3, 5, 50, 0, 0],
 [3, 5, 50, 0, 0],
 [3, 5, 50, 0, 0],
 [3, 5, 50, 0, 0]]

Solution 2

This means that the length of the various elements you're trying to split don't match.For X and y, sklearn will take the same indices, usually a random sample of 80% of the indices of your data. So, the lengths have to match.

Imagine it's trying to keep these indices. What would sklearn do when there's nothing at some index?

 0 1 0 0 1 0 1 0 0 1 0 1 0 1
 a b b a b a b a a b b b 
 ^   ^     ^ ^   ^   ^   ^ ^ 

Do this check to verify that the lengths match. Does this return True?

len(dataset) == len(labels)
Share:
20,922
rshah
Author by

rshah

I started learning programming, and the Object Oriented Programming concept through modifications of games. These games include Runescape (creating private, offshore-hosted servers) and Minecraft. Since these games have pre-written code, it enabled me to develop my problem-solving skills by working with code that has already been produced. Then I started developing minor 2D games in Python and Java, obviously none of which got past the alpha trial stage, and I went on to learning SQL, PHP and web design. I started a small hosting company called WireTrunk, lasting 3 months, but we shut it down after more cheaper hosting sites became available. Then I went on to developing websites for indie game companies such as Cliffedge Studios (now deprecated).

Updated on July 09, 2022

Comments

  • rshah
    rshah almost 2 years

    I have a multi-label classification problem, for which I looked online and saw that for one-hot encoding the labels it is best to use the MultiLabelBinarizer.

    I use this for my labels (which i separate from the dataset itself) as follows:

    ohe = MultiLabelBinarizer()
    labels = ohe.fit_transform(labels)
    train, test, train_labels, test_labels = train_test_split(dataset, labels, test_size=0.2) #80% train split
    

    But it throws me this following error:

    Traceback (most recent call last): 
      File "learn.py", line 114, in <module> 
        train, test, train_labels, test_labels = train_test_split(dataset, labels, test_size=0.2) #80% train split
      File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\model_selection\_split.py", line 2127, 
    in train_test_split
        arrays = indexable(*arrays)
      File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\utils\validation.py", line 293, in indexable
        check_consistent_length(*result)
      File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\utils\validation.py", line 256, in check_consistent_length
        raise ValueError("Found input variables with inconsistent numbers of"
    ValueError: Found input variables with inconsistent numbers of samples: [83292, 5]
    

    --

    EDIT: The labels dataset looks as follows (ignore the Interval column, this shouldnt be there and is not actually counted in the rows -- not sure why?):

              Movement  Distance  Speed  Delay  Loss 
    Interval
    0                1         1     25      0     0
    2                1         1     25      0     0
    4                1         1     25      0     0
    6                1         1     25      0     0
    8                1         1     25      0     0
    ...            ...       ...    ...    ...   ...
    260              3         5     50      0     0
    262              3         5     50      0     0
    264              3         5     50      0     0
    266              3         5     50      0     0
    268              3         5     50      0     0
    

    From this we can see that it is a multi-label multi-class classification problem. The shape of the dataset and labels before and after the Binarizer are as follows:

                 Before             After
    dataset      (83292, 15)        (83292, 15)
    labels       (83292, 5)         (5, 18)
    
  • rshah
    rshah almost 4 years
    Will I perform this comparison before or after the one-hot encoding using MultiLabelBinarizer?
  • Nicolas Gervais
    Nicolas Gervais almost 4 years
    Before or after doesn't matter, because this doesn't change the length of the data.
  • Nicolas Gervais
    Nicolas Gervais almost 4 years
    According to the MultiLabelBinarizer documentation, a common mistake is to pass a list. Maybe this is where you went wrong?
  • rshah
    rshah almost 4 years
    By doing len(dataset) == len(labels) before using MultiLabelBinarizer it returns True, but returns false afterwards with the lengths being 83292 and 5 for dataset and labels, respectively.
  • Nicolas Gervais
    Nicolas Gervais almost 4 years
    According to the docs, the frequent mistake can be solved by doing fit_transform([labels])
  • rshah
    rshah almost 4 years
    I tried the solution you posed Nicolas, but it doesnt work. The length after doing this changes to 1 for labels. The datatype of labels before is pandas.core.frame.DataFrame and after the MultiLabelBinarizer this becomes numpy.ndarray.
  • rshah
    rshah almost 4 years
    Also, should I maybe be performing the MultiLabelBinarizer after doing the train_test_split? And would using the sckit.multilearn iterative_train_test_split be adviseable?
  • Nicolas Gervais
    Nicolas Gervais almost 4 years
    i think you need to select one column only, not an entire pandas dataframe
  • rshah
    rshah almost 4 years
    Thanks for the solution! However, the labels dataset goes from having shape (83292, 5) to (83292, 14) any idea why?
  • Narendra Prasath
    Narendra Prasath almost 4 years
    @rshah does this solution working? It is a representation of binary for every label. There will be 14 unique classes available. that's why the shape is 14
  • rshah
    rshah almost 4 years
    It passes the binarizer without error, but I have 5 classes, but I am wondering why now the labels and dataset have the same shape (after binarizing the labels) of (83292, 14)
  • Narendra Prasath
    Narendra Prasath almost 4 years
    @rshah can you check the total no of unique labels?
  • rshah
    rshah almost 4 years
    I checked what the output of ohe.classes_ was, and it is not what I expected: [ 0 1 2 3 4 5 6 7 10 25 50 100 150 200]
  • Narendra Prasath
    Narendra Prasath almost 4 years
    @rshah This are available classes in your y label. that's correct I guess. Probably your question doesn't fit with your expectations. But, whatever the issue stated in your OP has been fixed now I guess.
  • rshah
    rshah almost 4 years
    Thanks. Maybe I have to work more on pre-processing.. perhaps change Movement to have values like 1, 2, 3, ... to split into more columns like Movement_1, Movement_2 etc.. so its easier for this.