How to reverse sklearn.OneHotEncoder transform to recover original data?
Solution 1
A good systematic way to figure this out is to start with some test data and work through the sklearn.OneHotEncoder
source with it. If you don't much care about how it works and simply want a quick answer, skip to the bottom.
X = np.array([
[3, 10, 15, 33, 54, 55, 78, 79, 80, 99],
[5, 1, 3, 7, 8, 12, 15, 19, 20, 8]
]).T
n_values_
Lines 1763-1786 determine the n_values_
parameter. This will be determined automatically if you set n_values='auto'
(the default). Alternatively you can specify a maximum value for all features (int) or a maximum value per feature (array). Let's assume that we're using the default. So the following lines execute:
n_samples, n_features = X.shape # 10, 2
n_values = np.max(X, axis=0) + 1 # [100, 21]
self.n_values_ = n_values
feature_indices_
Next the feature_indices_
parameter is calculated.
n_values = np.hstack([[0], n_values]) # [0, 100, 21]
indices = np.cumsum(n_values) # [0, 100, 121]
self.feature_indices_ = indices
So feature_indices_
is merely the cumulative sum of n_values_
with a 0 prepended.
Sparse Matrix Construction
Next, a scipy.sparse.coo_matrix
is constructed from the data. It is initialized from three arrays: the sparse data (all ones), the row indices, and the column indices.
column_indices = (X + indices[:-1]).ravel()
# array([ 3, 105, 10, 101, 15, 103, 33, 107, 54, 108, 55, 112, 78, 115, 79, 119, 80, 120, 99, 108])
row_indices = np.repeat(np.arange(n_samples, dtype=np.int32), n_features)
# array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9], dtype=int32)
data = np.ones(n_samples * n_features)
# array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
out = sparse.coo_matrix((data, (row_indices, column_indices)),
shape=(n_samples, indices[-1]),
dtype=self.dtype).tocsr()
# <10x121 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>
Note that the coo_matrix
is immediately converted to a scipy.sparse.csr_matrix
. The coo_matrix
is used as an intermediate format because it "facilitates fast conversion among sparse formats."
active_features_
Now, if n_values='auto'
, the sparse csr matrix is compressed down to only the columns with active features. The sparse csr_matrix
is returned if sparse=True
, otherwise it is densified before returning.
if self.n_values == 'auto':
mask = np.array(out.sum(axis=0)).ravel() != 0
active_features = np.where(mask)[0] # array([ 3, 10, 15, 33, 54, 55, 78, 79, 80, 99, 101, 103, 105, 107, 108, 112, 115, 119, 120])
out = out[:, active_features] # <10x19 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>
self.active_features_ = active_features
return out if self.sparse else out.toarray()
Decoding
Now let's work in reverse. We'd like to know how to recover X
given the sparse matrix that is returned along with the OneHotEncoder
features detailed above. Let's assume we actually ran the code above by instantiating a new OneHotEncoder
and running fit_transform
on our data X
.
from sklearn import preprocessing
ohc = preprocessing.OneHotEncoder() # all default params
out = ohc.fit_transform(X)
The key insight to solving this problem is understanding the relationship between active_features_
and out.indices
. For a csr_matrix
, the indices array contains the column numbers for each data point. However, these column numbers are not guaranteed to be sorted. To sort them, we can use the sorted_indices
method.
out.indices # array([12, 0, 10, 1, 11, 2, 13, 3, 14, 4, 15, 5, 16, 6, 17, 7, 18, 8, 14, 9], dtype=int32)
out = out.sorted_indices()
out.indices # array([ 0, 12, 1, 10, 2, 11, 3, 13, 4, 14, 5, 15, 6, 16, 7, 17, 8, 18, 9, 14], dtype=int32)
We can see that before sorting, the indices are actually reversed along the rows. In other words, they are ordered with the last column first and the first column last. This is evident from the first two elements: [12, 0]. 0 corresponds to the 3 in the first column of X
, since 3 is the minimum element it was assigned to the first active column. 12 corresponds to the 5 in the second column of X
. Since the first row occupies 10 distinct columns, the minimum element of the second column (1) gets index 10. The next smallest (3) gets index 11, and the third smallest (5) gets index 12. After sorting, the indices are ordered as we would expect.
Next we look at active_features_
:
ohc.active_features_ # array([ 3, 10, 15, 33, 54, 55, 78, 79, 80, 99, 101, 103, 105, 107, 108, 112, 115, 119, 120])
Notice that there are 19 elements, which corresponds to the number of distinct elements in our data (one element, 8, was repeated once). Notice also that these are arranged in order. The features that were in the first column of X
are the same, and the features in the second column have simply been summed with 100, which corresponds to ohc.feature_indices_[1]
.
Looking back at out.indices
, we can see that the maximum column number is 18, which is one minus the 19 active features in our encoding. A little thought about the relationship here shows that the indices of ohc.active_features_
correspond to the column numbers in ohc.indices
. With this, we can decode:
import numpy as np
decode_columns = np.vectorize(lambda col: ohc.active_features_[col])
decoded = decode_columns(out.indices).reshape(X.shape)
This gives us:
array([[ 3, 105],
[ 10, 101],
[ 15, 103],
[ 33, 107],
[ 54, 108],
[ 55, 112],
[ 78, 115],
[ 79, 119],
[ 80, 120],
[ 99, 108]])
And we can get back to the original feature values by subtracting off the offsets from ohc.feature_indices_
:
recovered_X = decoded - ohc.feature_indices_[:-1]
array([[ 3, 5],
[10, 1],
[15, 3],
[33, 7],
[54, 8],
[55, 12],
[78, 15],
[79, 19],
[80, 20],
[99, 8]])
Note that you will need to have the original shape of X
, which is simply (n_samples, n_features)
.
TL;DR
Given the sklearn.OneHotEncoder
instance called ohc
, the encoded data (scipy.sparse.csr_matrix
) output from ohc.fit_transform
or ohc.transform
called out
, and the shape of the original data (n_samples, n_feature)
, recover the original data X
with:
recovered_X = np.array([ohc.active_features_[col] for col in out.sorted_indices().indices])
.reshape(n_samples, n_features) - ohc.feature_indices_[:-1]
Solution 2
Just compute dot-product of the encoded values with ohe.active_features_
. It works both for sparse and dense representation. Example:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
orig = np.array([6, 9, 8, 2, 5, 4, 5, 3, 3, 6])
ohe = OneHotEncoder()
encoded = ohe.fit_transform(orig.reshape(-1, 1)) # input needs to be column-wise
decoded = encoded.dot(ohe.active_features_).astype(int)
assert np.allclose(orig, decoded)
The key insight is that the active_features_
attribute of the OHE model represents the original values for each binary column. Thus we can decode the binary-encoded number by simply computing a dot-product with active_features_
. For each data point there's just a single 1
the position of the original value.
Solution 3
Use numpy.argmax()
with axis = 1
.
Example:
ohe_encoded = np.array([[0, 0, 1], [0, 1, 0], [0, 1, 0], [1, 0, 0]])
ohe_encoded
> array([[0, 0, 1],
[0, 1, 0],
[0, 1, 0],
[1, 0, 0]])
np.argmax(ohe_encoded, axis = 1)
> array([2, 1, 1, 0], dtype=int64)
Solution 4
Since version 0.20 of scikit-learn, the active_features_
attribute of the OneHotEncoder
class has been deprecated, so I suggest to rely on the categories_
attribute instead.
The below function can help you recover the original data from a matrix that has been one-hot encoded:
def reverse_one_hot(X, y, encoder):
reversed_data = [{} for _ in range(len(y))]
all_categories = list(itertools.chain(*encoder.categories_))
category_names = ['category_{}'.format(i+1) for i in range(len(encoder.categories_))]
category_lengths = [len(encoder.categories_[i]) for i in range(len(encoder.categories_))]
for row_index, feature_index in zip(*X.nonzero()):
category_value = all_categories[feature_index]
category_name = get_category_name(feature_index, category_names, category_lengths)
reversed_data[row_index][category_name] = category_value
reversed_data[row_index]['target'] = y[row_index]
return reversed_data
def get_category_name(index, names, lengths):
counter = 0
for i in range(len(lengths)):
counter += lengths[i]
if index < counter:
return names[i]
raise ValueError('The index is higher than the number of categorical values')
To test it, I have created a small data set that includes the ratings that users have given to users
data = [
{'user_id': 'John', 'item_id': 'The Matrix', 'rating': 5},
{'user_id': 'John', 'item_id': 'Titanic', 'rating': 1},
{'user_id': 'John', 'item_id': 'Forrest Gump', 'rating': 2},
{'user_id': 'John', 'item_id': 'Wall-E', 'rating': 2},
{'user_id': 'Lucy', 'item_id': 'The Matrix', 'rating': 5},
{'user_id': 'Lucy', 'item_id': 'Titanic', 'rating': 1},
{'user_id': 'Lucy', 'item_id': 'Die Hard', 'rating': 5},
{'user_id': 'Lucy', 'item_id': 'Forrest Gump', 'rating': 2},
{'user_id': 'Lucy', 'item_id': 'Wall-E', 'rating': 2},
{'user_id': 'Eric', 'item_id': 'The Matrix', 'rating': 2},
{'user_id': 'Eric', 'item_id': 'Die Hard', 'rating': 3},
{'user_id': 'Eric', 'item_id': 'Forrest Gump', 'rating': 5},
{'user_id': 'Eric', 'item_id': 'Wall-E', 'rating': 4},
{'user_id': 'Diane', 'item_id': 'The Matrix', 'rating': 4},
{'user_id': 'Diane', 'item_id': 'Titanic', 'rating': 3},
{'user_id': 'Diane', 'item_id': 'Die Hard', 'rating': 5},
{'user_id': 'Diane', 'item_id': 'Forrest Gump', 'rating': 3},
]
data_frame = pandas.DataFrame(data)
data_frame = data_frame[['user_id', 'item_id', 'rating']]
ratings = data_frame['rating']
data_frame.drop(columns=['rating'], inplace=True)
If we are building a prediction model, we have to remember to delete the dependent variable (in this case the rating) from the DataFrame
before we encode it.
ratings = data_frame['rating']
data_frame.drop(columns=['rating'], inplace=True)
Then we proceed to do the encoding
ohc = OneHotEncoder()
encoded_data = ohc.fit_transform(data_frame)
print(encoded_data)
Which results in:
(0, 2) 1.0
(0, 6) 1.0
(1, 2) 1.0
(1, 7) 1.0
(2, 2) 1.0
(2, 5) 1.0
(3, 2) 1.0
(3, 8) 1.0
(4, 3) 1.0
(4, 6) 1.0
(5, 3) 1.0
(5, 7) 1.0
(6, 3) 1.0
(6, 4) 1.0
(7, 3) 1.0
(7, 5) 1.0
(8, 3) 1.0
(8, 8) 1.0
(9, 1) 1.0
(9, 6) 1.0
(10, 1) 1.0
(10, 4) 1.0
(11, 1) 1.0
(11, 5) 1.0
(12, 1) 1.0
(12, 8) 1.0
(13, 0) 1.0
(13, 6) 1.0
(14, 0) 1.0
(14, 7) 1.0
(15, 0) 1.0
(15, 4) 1.0
(16, 0) 1.0
(16, 5) 1.0
After encoding the we can reverse using the reverse_one_hot
function we defined above, like this:
reverse_data = reverse_one_hot(encoded_data, ratings, ohc)
print(pandas.DataFrame(reverse_data))
Which gives us:
category_1 category_2 target
0 John The Matrix 5
1 John Titanic 1
2 John Forrest Gump 2
3 John Wall-E 2
4 Lucy The Matrix 5
5 Lucy Titanic 1
6 Lucy Die Hard 5
7 Lucy Forrest Gump 2
8 Lucy Wall-E 2
9 Eric The Matrix 2
10 Eric Die Hard 3
11 Eric Forrest Gump 5
12 Eric Wall-E 4
13 Diane The Matrix 4
14 Diane Titanic 3
15 Diane Die Hard 5
16 Diane Forrest Gump 3
Solution 5
How to one-hot encode
See https://stackoverflow.com/a/42874726/562769
import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]
def indices_to_one_hot(data, nb_classes):
"""Convert an iterable of indices to one-hot encoded labels."""
targets = np.array(data).reshape(-1)
return np.eye(nb_classes)[targets]
How to reverse
def one_hot_to_indices(data):
indices = []
for el in data:
indices.append(list(el).index(1))
return indices
hot = indices_to_one_hot(orig_data, nb_classes)
indices = one_hot_to_indices(hot)
print(orig_data)
print(indices)
gives:
[[2, 3, 4, 0]]
[2, 3, 4, 0]
Related videos on Youtube
Phyreece
Updated on May 17, 2021Comments
-
Phyreece almost 3 years
I encoded my categorical data using
sklearn.OneHotEncoder
and fed them to a random forest classifier. Everything seems to work and I got my predicted output back.Is there a way to reverse the encoding and convert my output back to its original state?
-
canyon289 over 8 yearsI feel like I have the same lack of understanding. Why is this a strange question? Without decoding I wouldn't be able to tell what factor coded into 0,1 is paired with what coefficient
-
blueberryfields over 8 yearsthe onehotencoding implements the vanilla one-of-k algorithm - which optimizes performance by not using a fixed ordering for parameters. this means the algorithm doesn't guarantee the same encoding on multiple runs, and is not reversible. i'm not sure of your use case - if you're looking to do decoding, you're most likely using the wrong algorithm implementation - look at DictVectorizer, or extend the default with a mapping and a custom decoder.
-
Mack about 8 yearsWhile it is true that the algorithm does not guarantee the same encoding on multiple runs, it is false that it is not reversible. It is actually quite easily reversible. Please see my answer for the procedure and a thorough explanation.
-
user3391229 about 8 yearsTBH I skipped to the TL;DR. However, I found it would not work for me unless I used "out.sort_indices().indices" instead of merely "out.indices". Otherwise, I needed to switch the order of my two columns before subtracting "ohc.feature_indices_[:-1]"
-
Mack about 8 yearsQuite right! I included that in the longer answer but left it out of the TL;DR. I've edited to fix this.
-
blueberryfields about 8 years@Mack have you read your answer and explanation? we have different definitions of easy i think ;)
-
Mack about 8 yearsI suppose we do. The TL;DR isn't so bad though. : )
-
user3391229 about 8 yearsFor me I tried to use the dictvectorizer, but my pandas dataframe was too big to convert to a dict of dicts. Hence Mack's answer provided me some of the functionality I needed.
-
NickBraunagel almost 7 yearsThis approach doesn't work for me when
orig
is a multi-dimensional array (e.g.orig = np.array([[6, 9, 8, 2, 5, 4, 5, 3, 3, 6],[6, 9, 8, 2, 5, 4, 5, 3, 3, 6]])
) -
NickBraunagel almost 7 years@Mack Great answer, thank you! Now, what about when we pass the OneHotEncoded
X
to a predictive model (logistic regression, SVM etc.). How do we map the model's coefficients back to X? I want to be able to say, "variablefoo
increases the target bybar_coeff
" but I don't understand how to map the model's coefficients back to the original categoryX
. Here is the full-blown question posed by another user on SO: stackoverflow.com/questions/40141710/… -
NickBraunagel almost 7 years@Mack and here is my question on it: stackoverflow.com/questions/45041387/…
-
Andreus over 6 years@blueberryfields Now I'm confused. If I run
onehot.fit()
, can I then safely runonehot.transform()
on different input matrices and get resulting matrices with the same meaning? -
TheRajVJain over 6 years@Phyreese, you can select this as the answer
-
toom about 5 yearsThis is not an answer to this question!
-
Monica Heddneck almost 3 years@Mack: I get
AttributeError: 'OneHotEncoder' object has no attribute 'sorted_indices'
. Does your answer still work for the latest version of sklearn? -
Monica Heddneck almost 3 yearswhat library did you get
matrix_utils
from? -
melqkiades almost 3 years@MonicaHeddneck It's a personal library. I have fixed the code now.
-
Mack almost 3 years@Monica, it's been a while since posting this, but I believe that object is a
scipy.csr_matrix
. So you'll want to look at yourscipy
version for compatibility. Alternatively, you can just replace that with code that performs an equivalent sort.