Using Scikit-Learn OneHotEncoder with a Pandas DataFrame

50,955

Solution 1

So turned out that Scikit-Learns LabelBinarizer gave me better luck in converting the data to one-hot encoded format, with help from Amnie's solution, my final code is as follows

import pandas as pd
from sklearn.preprocessing import LabelBinarizer

jobs_encoder = LabelBinarizer()
jobs_encoder.fit(data['Profession'])
transformed = jobs_encoder.transform(data['Profession'])
ohe_df = pd.DataFrame(transformed)
data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)

Solution 2

OneHotEncoder Encodes categorical integer features as a one-hot numeric array. Its Transform method returns a sparse matrix if sparse=True, otherwise it returns a 2-d array.

You can't cast a 2-d array (or sparse matrix) into a Pandas Series. You must create a Pandas Serie (a column in a Pandas dataFrame) for each category.

I would recommend pandas.get_dummies instead:

data = pd.get_dummies(data,prefix=['Profession'], columns = ['Profession'], drop_first=True)

EDIT:

Using Sklearn OneHotEncoder:

transformed = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))
#Create a Pandas DataFrame of the hot encoded column
ohe_df = pd.DataFrame(transformed, columns=jobs_encoder.get_feature_names())
#concat with original data
data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)

Other Options: If you are doing hyperparameter tuning with GridSearch it's recommanded to use ColumnTransformer and FeatureUnion with Pipeline or directly make_column_transformer

Solution 3

This below is an approach suggested by Kaggle Learn. Do not think there is a simpler way to do so at the moment to go from an original pandas DataFrame to a one-hot encoded DataFrame.

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
print(OH_X_train)
Share:
50,955

Related videos on Youtube

dd.
Author by

dd.

Computer Engineering Graduate

Updated on February 22, 2022

Comments

  • dd.
    dd. about 2 years

    I'm trying to replace a column within a Pandas DataFrame containing strings into a one-hot encoded equivalent using Scikit-Learn's OneHotEncoder. My code below doesn't work:

    from sklearn.preprocessing import OneHotEncoder
    # data is a Pandas DataFrame
    
    jobs_encoder = OneHotEncoder()
    jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))
    data['Profession'] = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))
    

    It produces the following error (strings in the list are omitted):

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-91-3a1f568322f5> in <module>()
          3 jobs_encoder = OneHotEncoder()
          4 jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))
    ----> 5 data['Profession'] = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))
    
    /usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in transform(self, X)
        730                                        copy=True)
        731         else:
    --> 732             return self._transform_new(X)
        733 
        734     def inverse_transform(self, X):
    
    /usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform_new(self, X)
        678         """New implementation assuming categorical input"""
        679         # validation of X happens in _check_X called by _transform
    --> 680         X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
        681 
        682         n_samples, n_features = X_int.shape
    
    /usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown)
        120                     msg = ("Found unknown categories {0} in column {1}"
        121                            " during transform".format(diff, i))
    --> 122                     raise ValueError(msg)
        123                 else:
        124                     # Set the problematic rows to an acceptable value and
    
    ValueError: Found unknown categories ['...', ..., '...'] in column 0 during transform
    

    Here's some sample data:

    data['Profession'] =
    
    0         unkn
    1         safe
    2         rece
    3         unkn
    4         lead
              ... 
    111988    indu
    111989    seni
    111990    mess
    111991    seni
    111992    proj
    Name: Profession, Length: 111993, dtype: object
    

    What exactly am I doing wrong?

    • desertnaut
      desertnaut over 4 years
      Please include the full error trace, as well as a sample of your data['Profession'].
    • Quang Hoang
      Quang Hoang over 4 years
      one hot encoder would return a 2d array of size data_length x num_categories. You cannot assign to a single column df['Profession'].
    • Novice
      Novice almost 4 years
      Followup on dd answer. We can use OneHotEncoder for multi column data, while not for LabelBinarizer and LabelEncoder. stackoverflow.com/a/54119850/1582366
  • dd.
    dd. over 4 years
    I want to be able to pickle the instance to use it on new data in the future, that's why I want to use OneHotEncoder, that can't be done with get_dummies right?
  • Abel Paz
    Abel Paz almost 4 years
    That's right. If you want to use it on new data, you can't use get_dummies.