SVC (support vector classification) with categorical (string) data as labels

11,612

Solution 1

Take a look at http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features section 4.3.4 Encoding categorical features.

In particular, look at using the OneHotEncoder. This will convert categorical values into a format that can be used by SVM's.

Solution 2

you can try this code:

from sklearn import svm
X = [[0, 0], [1, 1],[2,3]]
y = ['A', 'B','C']
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(X, y)  
clf.predict([[2,3]])

output: array(['C'], dtype='|S1')

You should take the dependent variable (y) as 'list'.

Share:
11,612
beta
Author by

beta

Updated on June 17, 2022

Comments

  • beta
    beta almost 2 years

    I use scikit-learn to implement a simple supervised learning algorithm. In essence I follow the tutorial here (but with my own data).

    I try to fit the model:

    clf = svm.SVC(gamma=0.001, C=100.)
    clf.fit(features_training,labels_training)
    

    But at the second line, I get an error: ValueError: could not convert string to float: 'A'

    The error is expected because label_training contains string values which represent three different categories, such as A, B, C.

    So the question is: How do I use SVC (support vector classification), if the labelled data represents categories in form of strings. One intuitive solution to me seems to simply convert each string to a number. For instance, A = 0, B = 1, etc. But is this really the best solution?

  • Martin Thoma
    Martin Thoma almost 8 years
    You should at least link directly to the section and mention the OneHotEncoder
  • gtzinos
    gtzinos over 6 years
    But how could hotencoding help you when you will try to predict a new color ? Maybe in your case you have to retrain the model. Do you have any solution ?