Convert categorical data into numerical data in Python

python machine-learning encoding nlp categorical-data

21,815

You probably want to use an Encoder. One of the most used and popular ones are LabelEncoder and OneHotEncoder. Both are provided as parts of sklearn library.

LabelEncoder can be used to transform categorical data into integers:

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
x = ['Apple', 'Orange', 'Apple', 'Pear']
y = label_encoder.fit_transform(x)
print(y)

array([0, 1, 0, 2])

This would transform a list of ['Apple', 'Orange', 'Apple', 'Pear'] into [0, 1, 0, 2] with each integer corresponding to an item. This is not always ideal for ML as the integers have different numerical values, suggesting that one is bigger than the other, with, for example Pear > Apple, which is not at all the case. To not introduce this kind of problem you'd want to use OneHotEncoder.

OneHotEncoder can be used to transform categorical data into one hot encoded array. Encoding previously defined y by using OneHotEncoder would result in:

from numpy import array
from numpy import argmax
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)
y = y.reshape(len(y), 1)
onehot_encoded = onehot_encoder.fit_transform(y)
print(onehot_encoded)

[[1. 0. 0.]
[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]]

Where each element of x turns into an array of zeroes and just one 1 which encodes the category of the element.

A simple tutorial on how to use this on a DataFrame can be found here.

21,815

user3359964

Updated on October 08, 2020

Comments

user3359964 over 3 years

I have a data set. One of its columns - "Keyword" - contains categorical data. The machine learning algorithm that I am trying to use takes only numeric data. I want to convert "Keyword" column into numeric values - How can I do that? Using NLP? Bag of words?

I tried the following but I got ValueError: Expected 2D array, got 1D array instead.

from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
dataset['Keyword'] = count_vector.fit_transform(dataset['Keyword'])
from sklearn.model_selection import train_test_split
y=dataset['C']
x=dataset(['Keyword','A','B'])
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train,y_train)

user3359964 over 4 years

Thanks for your quick reply! but my ' Keyword' column contains a set of words. For example "HTML Language"..
Alexander Rossa over 4 years

@user3359964 I am not sure I understand. Can you extend your question with the exact definition of "Keyword"? Do you mean that on each row under that column there is a string with multiple words separated by space? Or that each item under that column is a list of strings?
user3359964 over 4 years

Sorry for that.. Yeah , Each row of the 'Keyword' column contains a string with multiple words separated by space.
user3359964 over 4 years

Index Keywords 1 HTML language 2 Math tutorial
Alexander Rossa over 4 years

@user3359964 I see. Well, then it depends on how you envisage the desired encoding to look like. What do you want the output to be like? A different category per unique string? Or take into account just the first space-separated word from that string when creating categories? It all really depends on what you want to do. Perhaps if each word in that string is meaningful it should really be separated into its own column, eg. Subject, Type so that "HTML" and "Math" will be under Subject and "language" and "tutorial" under Type. Then encode those two columns as in my answer.