SVC (support vector classification) with categorical (string) data as labels

python machine-learning scikit-learn svm

11,612

Solution 1

In particular, look at using the OneHotEncoder. This will convert categorical values into a format that can be used by SVM's.

you can try this code:

from sklearn import svm
X = [[0, 0], [1, 1],[2,3]]
y = ['A', 'B','C']
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(X, y)  
clf.predict([[2,3]])

output: array(['C'], dtype='|S1')

You should take the dependent variable (y) as 'list'.

11,612

Author by

Updated on June 17, 2022

beta almost 2 years
I use scikit-learn to implement a simple supervised learning algorithm. In essence I follow the tutorial here (but with my own data).

I try to fit the model:
```
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(features_training,labels_training)
```
But at the second line, I get an error: ValueError: could not convert string to float: 'A'

The error is expected because label_training contains string values which represent three different categories, such as A, B, C.

So the question is: How do I use SVC (support vector classification), if the labelled data represents categories in form of strings. One intuitive solution to me seems to simply convert each string to a number. For instance, A = 0, B = 1, etc. But is this really the best solution?
Martin Thoma almost 8 years

You should at least link directly to the section and mention the OneHotEncoder
gtzinos over 6 years

But how could hotencoding help you when you will try to predict a new color ? Maybe in your case you have to retrain the model. Do you have any solution ?