"TypeError: Singleton array cannot be considered a valid collection" using sklearn train_test_split

19,121

A not-so-commonly known fact is that train_test_split can split any number of arrays, not just two ("train", and "test"). See the linked docs and the source code for more info.

For example,

np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
y = df1.pop('C')
z = df1.pop('D')
X = df1

splits = train_test_split(X, y, z, test_size=0.2)
len(splits)
# 6

IOW, the only way to specify the test size is by specifying the keyword argument test_size. All positional arguments are assumed to be collections that are to be split, and in your case, since you do

train_test_split(X, y, 0.2)

The function tries to split 0.2, but since a float is not a collection, the error is raised. The solution is to (as mentioned), specify the keyword argument:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Share:
19,121
Admin
Author by

Admin

Updated on June 17, 2022

Comments

  • Admin
    Admin almost 2 years

    TypeError: Singleton array array(0.2) cannot be considered a valid collection.

    X = df.iloc[:, [1,7]].values
    y= df.iloc[:,-1].values
    from sklearn.model_selection import train_test_split 
    X_train, X_test, y_train, y_test = train_test_split(X, y, 0.2)
    

    I am getting this error when trying to train_test_split. I am able to train my model with X and y values. However, i would like to split my dataframe and then train and test it.

    Any help is appreciated.

  • Admin
    Admin over 5 years
    Thanks. That works. I have been stuck here for past two hours. Need one more clarification. What will happen if my dataset has only a single row of value?
  • cs95
    cs95 over 5 years
    @JohnSamuel One split will be empty, and the other split will have the same, single row. Hope that answers it.
  • Vivek Kumar
    Vivek Kumar over 5 years
    @JohnSamuel More precisely, train split will be empty and test split will get that single row in the current implementation. There is an ongoing discussion as to what should be the default behaviour for this case.