sklearn train_test_split - ValueError: Found input variables with inconsistent numbers of samples
Solution 1
As you stated, labels orginal shape is (83292, 5)
and once you applied MultiLabelBinarizer
it became (5, 18)
.
train_test_split(X, y)
function expect that X and y should have the same rows. E.g: 83292
datapoints available in your X
and respective datapoints label should be available in your y
variable.
Hence, in your case it should be X
and y
shape should be (83292, 15)
and (83292, 18)
.
Try this:
Your MultiLabelBinarizer
output having wrong dimension here. So, if your labels
is a dataframe object, then you should apply mlb.fit_transform(labels.values.tolist())
.
this would produce the same no of rows as output here 83292
.
Example of your labels should be like below format:
your y
input can be like list of list
or dataframe having one column which having list of values
. Make sure you have X and y having same no of rows. You can represent multi-label multi-class y
variable like below format. Or dataframe.shape should be (no_of_rows, 1)
[[1, 1, 25, 0, 0],
[1, 1, 25, 0, 0],
[1, 1, 25, 0, 0],
[1, 1, 25, 0, 0],
[1, 1, 25, 0, 0],
[3, 5, 50, 0, 0],
[3, 5, 50, 0, 0],
[3, 5, 50, 0, 0],
[3, 5, 50, 0, 0],
[3, 5, 50, 0, 0]]
Solution 2
This means that the length of the various elements you're trying to split don't match.For X
and y
, sklearn
will take the same indices, usually a random sample of 80% of the indices of your data. So, the lengths have to match.
Imagine it's trying to keep these indices. What would sklearn
do when there's nothing at some index?
0 1 0 0 1 0 1 0 0 1 0 1 0 1
a b b a b a b a a b b b
^ ^ ^ ^ ^ ^ ^ ^
Do this check to verify that the lengths match. Does this return True
?
len(dataset) == len(labels)
rshah
I started learning programming, and the Object Oriented Programming concept through modifications of games. These games include Runescape (creating private, offshore-hosted servers) and Minecraft. Since these games have pre-written code, it enabled me to develop my problem-solving skills by working with code that has already been produced. Then I started developing minor 2D games in Python and Java, obviously none of which got past the alpha trial stage, and I went on to learning SQL, PHP and web design. I started a small hosting company called WireTrunk, lasting 3 months, but we shut it down after more cheaper hosting sites became available. Then I went on to developing websites for indie game companies such as Cliffedge Studios (now deprecated).
Updated on July 09, 2022Comments
-
rshah almost 2 years
I have a multi-label classification problem, for which I looked online and saw that for one-hot encoding the labels it is best to use the
MultiLabelBinarizer
.I use this for my labels (which i separate from the dataset itself) as follows:
ohe = MultiLabelBinarizer() labels = ohe.fit_transform(labels) train, test, train_labels, test_labels = train_test_split(dataset, labels, test_size=0.2) #80% train split
But it throws me this following error:
Traceback (most recent call last): File "learn.py", line 114, in <module> train, test, train_labels, test_labels = train_test_split(dataset, labels, test_size=0.2) #80% train split File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\model_selection\_split.py", line 2127, in train_test_split arrays = indexable(*arrays) File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\utils\validation.py", line 293, in indexable check_consistent_length(*result) File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\utils\validation.py", line 256, in check_consistent_length raise ValueError("Found input variables with inconsistent numbers of" ValueError: Found input variables with inconsistent numbers of samples: [83292, 5]
--
EDIT: The labels dataset looks as follows (ignore the
Interval
column, this shouldnt be there and is not actually counted in the rows -- not sure why?):Movement Distance Speed Delay Loss Interval 0 1 1 25 0 0 2 1 1 25 0 0 4 1 1 25 0 0 6 1 1 25 0 0 8 1 1 25 0 0 ... ... ... ... ... ... 260 3 5 50 0 0 262 3 5 50 0 0 264 3 5 50 0 0 266 3 5 50 0 0 268 3 5 50 0 0
From this we can see that it is a multi-label multi-class classification problem. The
shape
of thedataset
andlabels
before and after the Binarizer are as follows:Before After dataset (83292, 15) (83292, 15) labels (83292, 5) (5, 18)
-
rshah almost 4 yearsWill I perform this comparison before or after the one-hot encoding using
MultiLabelBinarizer
? -
Nicolas Gervais almost 4 yearsBefore or after doesn't matter, because this doesn't change the length of the data.
-
Nicolas Gervais almost 4 yearsAccording to the
MultiLabelBinarizer
documentation, a common mistake is to pass a list. Maybe this is where you went wrong? -
rshah almost 4 yearsBy doing
len(dataset) == len(labels)
before usingMultiLabelBinarizer
it returnsTrue
, but returns false afterwards with the lengths being83292
and5
for dataset and labels, respectively. -
Nicolas Gervais almost 4 yearsAccording to the docs, the frequent mistake can be solved by doing
fit_transform([labels])
-
rshah almost 4 yearsI tried the solution you posed Nicolas, but it doesnt work. The length after doing this changes to 1 for labels. The datatype of labels before is
pandas.core.frame.DataFrame
and after theMultiLabelBinarizer
this becomesnumpy.ndarray
. -
rshah almost 4 yearsAlso, should I maybe be performing the
MultiLabelBinarizer
after doing thetrain_test_split
? And would using the sckit.multilearniterative_train_test_split
be adviseable? -
Nicolas Gervais almost 4 yearsi think you need to select one column only, not an entire pandas dataframe
-
rshah almost 4 yearsThanks for the solution! However, the labels dataset goes from having shape
(83292, 5)
to(83292, 14)
any idea why? -
Narendra Prasath almost 4 years@rshah does this solution working? It is a representation of binary for every label. There will be
14
unique classes available. that's why the shape is 14 -
rshah almost 4 yearsIt passes the binarizer without error, but I have 5 classes, but I am wondering why now the labels and dataset have the same shape (after binarizing the labels) of
(83292, 14)
-
Narendra Prasath almost 4 years@rshah can you check the total no of unique labels?
-
rshah almost 4 yearsI checked what the output of
ohe.classes_
was, and it is not what I expected:[ 0 1 2 3 4 5 6 7 10 25 50 100 150 200]
-
Narendra Prasath almost 4 years@rshah This are available classes in your y label. that's correct I guess. Probably your question doesn't fit with your expectations. But, whatever the issue stated in your OP has been fixed now I guess.
-
rshah almost 4 yearsThanks. Maybe I have to work more on pre-processing.. perhaps change Movement to have values like 1, 2, 3, ... to split into more columns like Movement_1, Movement_2 etc.. so its easier for this.