ValueError: Found input variables with inconsistent numbers of samples
The issue is with your labels
list. Internally when stratify
is provided to train_test_split
the value gets passed as the y
argument to the split
method of an instance of StratifiedShuffleSplit
. As you can see in the documentation for the split
method y
should be the same length as X
(in this case the arrays you wish to split). So in order to fix your problem instead of passing stratify=labels
just use stratify=y
Comments
-
Rodrigo Laguna about 2 years
There are tons of samples from this error in which the problem is related with dimensions of the array or how a dataframe is read. However, I'm using just a python list for both X and Y.
I'm trying to split my code in train and test with
train_test_split
.My code is this:
X, y = file2vector(corpus_dir) assert len(X) == len(y) # both lists same length print(type(X)) print(type(y)) seed = 123 labels = list(set(y)) print(len(labels)) print(labels) cont = {} for l in y: if not l in cont: cont[l] = 1 else: cont[l] += 1 print(cont) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=seed, stratify=labels)
Output is:
<class 'list'> # type(X) <class 'list'> # type(y) 2 # len(labels) ['I', 'Z'] # labels {'I': 18867, 'Z': 13009} # cont
X
andy
are just Python lists of Python strings that I read from a file withfile2vector
. I'm running on python 3, and backtrace is the following:Traceback (most recent call last): File "/home/rodrigo/idatha/no_version/imm/classifier.py", line 28, in <module> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=seed, stratify=labels) File "/home/rodrigo/idatha/no_version/imm/.env/lib/python3.5/site-packages/sklearn/model_selection/_split.py", line 2056, in train_test_split train, test = next(cv.split(X=arrays[0], y=stratify)) File "/home/rodrigo/idatha/no_version/imm/.env/lib/python3.5/site-packages/sklearn/model_selection/_split.py", line 1203, in split X, y, groups = indexable(X, y, groups) File "/home/rodrigo/idatha/no_version/imm/.env/lib/python3.5/site-packages/sklearn/utils/validation.py", line 229, in indexable check_consistent_length(*result) File "/home/rodrigo/idatha/no_version/imm/.env/lib/python3.5/site-packages/sklearn/utils/validation.py", line 204, in check_consistent_length " samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [31876, 2]