Sklearn Pipeline: Get feature names after OneHotEncode In ColumnTransformer

36,500

Solution 1

You can access the feature_names using the following snippet!

clf.named_steps['preprocessor'].transformers_[1][1]\
   .named_steps['onehot'].get_feature_names(categorical_features)

Using sklearn >= 0.21 version, we can make it more simpler:

clf['preprocessor'].transformers_[1][1]['onehot']\
                   .get_feature_names(categorical_features)

Reproducible example:

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression

df = pd.DataFrame({'brand': ['aaaa', 'asdfasdf', 'sadfds', 'NaN'],
                   'category': ['asdf', 'asfa', 'asdfas', 'as'],
                   'num1': [1, 1, 0, 0],
                   'target': [0.2, 0.11, 1.34, 1.123]})

numeric_features = ['num1']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['brand', 'category']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor',  LinearRegression())])
clf.fit(df.drop('target', 1), df['target'])

clf.named_steps['preprocessor'].transformers_[1][1]\
   .named_steps['onehot'].get_feature_names(categorical_features)

# ['brand_NaN' 'brand_aaaa' 'brand_asdfasdf' 'brand_sadfds' 'category_as'
#  'category_asdf' 'category_asdfas' 'category_asfa']

Solution 2

Scikit-Learn 1.0 now has new features to keep track of feature names.

from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# SimpleImputer does not have get_feature_names_out, so we need to add it
# manually. This should be fixed in Scikit-Learn 1.0.1: all transformers will
# have this method.
# g
SimpleImputer.get_feature_names_out = (lambda self, names=None:
                                       self.feature_names_in_)

num_pipeline = make_pipeline(SimpleImputer(), StandardScaler())
transformer = make_column_transformer(
    (num_pipeline, ["age", "height"]),
    (OneHotEncoder(), ["city"]))
pipeline = make_pipeline(transformer, LinearRegression())



df = pd.DataFrame({"city": ["Rabat", "Tokyo", "Paris", "Auckland"],
                   "age": [32, 65, 18, 24],
                   "height": [172, 163, 169, 190],
                   "weight": [65, 62, 54, 95]},
                  index=["Alice", "Bunji", "Cécile", "Dave"])



pipeline.fit(df, df["weight"])


## get pipeline feature names
pipeline[:-1].get_feature_names_out()


## specify feature names as your columns
pd.DataFrame(pipeline[:-1].transform(df),
             columns=pipeline[:-1].get_feature_names_out(),
             index=df.index)

Solution 3

EDIT: actually Peter's comment answer is in the ColumnTransformer doc:

The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.


To complete Venkatachalam's answer with what Paul asked in his comment, the order of feature names as it appears in the ColumnTransformer .get_feature_names() method depends on the order of declaration of the steps variable at the ColumnTransformer instanciation.

I could not find any doc so I just played with the toy example below and that let me understand the logic.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import RobustScaler

class testEstimator(BaseEstimator,TransformerMixin):
    def __init__(self,string):
        self.string = string

    def fit(self,X):
        return self

    def transform(self,X):
        return np.full(X.shape, self.string).reshape(-1,1)

    def get_feature_names(self):
        return self.string

transformers = [('first_transformer',testEstimator('A'),1), ('second_transformer',testEstimator('B'),0)]
column_transformer = ColumnTransformer(transformers)
steps = [('scaler',RobustScaler()), ('transformer', column_transformer)]
pipeline = Pipeline(steps)

dt_test = np.zeros((1000,2))
pipeline.fit_transform(dt_test)

for name,step in pipeline.named_steps.items():
    if hasattr(step, 'get_feature_names'):
        print(step.get_feature_names())

For the sake of having a more representative example I added a RobustScaler and nested the ColumnTransformer on a Pipeline. By the way, you will find my version of Venkatachalam's way to get the feature name looping of the steps. You can turn it into a slightly more usable variable by unpacking the names with a list comprehension:

[i for i in v.get_feature_names() for k, v in pipeline.named_steps.items() if hasattr(v,'get_feature_names')]

So play around with the dt_test and the estimators to soo how the feature name is built, and how it is concatenated in the get_feature_names(). Here is another example with a transformer which output 2 columns, using the input column:

class testEstimator3(BaseEstimator,TransformerMixin):
    def __init__(self,string):
        self.string = string

    def fit(self,X):
        self.unique = np.unique(X)[0]
        return self

    def transform(self,X):
        return np.concatenate((X.reshape(-1,1), np.full(X.shape,self.string).reshape(-1,1)), axis = 1)

    def get_feature_names(self):
        return list((self.unique,self.string))

dt_test2 = np.concatenate((np.full((1000,1),'A'),np.full((1000,1),'B')), axis = 1)

transformers = [('first_transformer',testEstimator3('A'),1), ('second_transformer',testEstimator3('B'),0)]
column_transformer = ColumnTransformer(transformers)
steps = [('transformer', column_transformer)]
pipeline = Pipeline(steps)

pipeline.fit_transform(dt_test2)
for step in pipeline.steps:
    if hasattr(step[1], 'get_feature_names'):
        print(step[1].get_feature_names())

Solution 4

If you are looking for how to access column names after successive pipelines with the last one being ColumnTransformer, you can access them by following this example:

In the full_pipeline there are two pipelines gender and relevent_experience

full_pipeline = ColumnTransformer([
    ("gender", gender_encoder, ["gender"]),
    ("relevent_experience", relevent_experience_encoder, ["relevent_experience"]),
])

The gender pipeline looks like this:

gender_encoder = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ("cat", OneHotEncoder())
])

After fitting the full_pipeline, you can access the column names using the following snippet

full_pipeline.transformers_[0][1][1].get_feature_names()

In my case the output was: array(['x0_Female', 'x0_Male', 'x0_Other'], dtype=object)

Share:
36,500
ResidentSleeper
Author by

ResidentSleeper

😂🤣

Updated on July 05, 2022

Comments

  • ResidentSleeper
    ResidentSleeper almost 2 years

    I want to get feature names after I fit the pipeline.

    categorical_features = ['brand', 'category_name', 'sub_category']
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])
        
    numeric_features = ['num1', 'num2', 'num3', 'num4']
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)])
    

    Then

    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('regressor', GradientBoostingRegressor())])
    

    After fitting with pandas dataframe, I can get feature importances from

    clf.steps[1][1].feature_importances_

    and I tried clf.steps[0][1].get_feature_names() but I got an error

    AttributeError: Transformer num (type Pipeline) does not provide get_feature_names.
    

    How can I get feature names from this?

  • Paul
    Paul about 5 years
    how does one correctly match the feature importances with ALL the feature names (numeric + categorical)? Especially with OHE(handle_unknown='ignore').
  • ResidentSleeper
    ResidentSleeper about 5 years
    @Paul In my case, I have combined df.columns with feature_names after that I removed categorical_features from the name list then combined it with feature_importances_.
  • Paul
    Paul about 5 years
    exactly, but how do you make sure they are combined in the right order, so that they match up with the vector of feature importances? Seems not straighforward, would appreciate elegant code snippets
  • Venkatachalam
    Venkatachalam over 4 years
    The combining order would be same as the pipeline steps. Hence we can find the exact order of the features. stackoverflow.com/a/57534118/6347629 answer might be of use for you
  • Ozkan Serttas
    Ozkan Serttas over 3 years
    So StandardScaler() does not have get_feature_names() . Do we have to combine the field names of numeric ones and one hot encoded ones later? Is there any other API that could provide us with the full feature names?
  • Venkatachalam
    Venkatachalam over 3 years
    This answer stackoverflow.com/questions/57528350/… might give some clarity.
  • Maths12
    Maths12 over 2 years
    this does not work for me since i get AttributeError: 'ColumnTransformer' object has no attribute 'transformers_'
  • Andi Anderle
    Andi Anderle over 2 years
    For me this results in Estimator encoder does not provide get_feature_names_out. Did you mean to call pipeline[:-1].get_feature_names_out()?
  • ZAKARYA ROUZKI
    ZAKARYA ROUZKI over 2 years
    @AndiAnderle get_feature_names_out is not implemented on all estimators, see github.com/scikit-learn/scikit-learn/issues/21308 , I am using pipeline[:-1] to select only the column transformers step.
  • Andi Anderle
    Andi Anderle over 2 years
    That's exactly what I do (pipeline[0].get_feature_names_out()). pipeline[0] is my ColumnTransformer with OrdinalEncoder and SimpleImputer. Still says the above mentioned error.
  • ZAKARYA ROUZKI
    ZAKARYA ROUZKI over 2 years
    Are you sure that you have Scikit-Learn 1.0 version ?
  • Andi Anderle
    Andi Anderle over 2 years
    Yes. 1.0.1… really srange..
  • Just trying
    Just trying over 2 years
    Did you figure this out? If so please share I'm interested: I'm trying to do the same thing. OrdinalEncoder a pipeline with imputer and ordinal encoder, and I need to keep track of the feature names after fitting
  • chris
    chris over 2 years
    SimpleImputer doesn't have the function get_feature_names_out, unless you're using the nightly sklearn build.
  • pabz
    pabz almost 2 years
    I am curious if the sequence of your pipeline matters. For example I derived my plot_names using this process, but would like validation this is correct. The feature importance plot looks intuitive. plot_names = pipeline.named_steps['preprocessor'].transformers_[0][2] + list(pipeline.named_steps['preprocessor'].transformers_[1][1‌​].get_feature_names(‌​categorical_feature_‌​indexes))