What is feature hashing (hashing-trick)?

python hash vector machine-learning

17,953

Solution 1

On Pandas, you could use something like this:

import pandas as pd
import numpy as np

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

data = pd.DataFrame(data)

def hash_col(df, col, N):
    cols = [col + "_" + str(i) for i in range(N)]
    def xform(x): tmp = [0 for i in range(N)]; tmp[hash(x) % N] = 1; return pd.Series(tmp,index=cols)
    df[cols] = df[col].apply(xform)
    return df.drop(col,axis=1)

print hash_col(data, 'state',4)

The output would be

   pop  year  state_0  state_1  state_2  state_3
0  1.5  2000        0        1        0        0
1  1.7  2001        0        1        0        0
2  3.6  2002        0        1        0        0
3  2.4  2001        0        0        0        1
4  2.9  2002        0        0        0        1

Also on Series level, you could

import numpy as np, os import sys, pandas as pd

def hash_col(df, col, N):
    df = df.replace('',np.nan)
    cols = [col + "_" + str(i) for i in range(N)]
    tmp = [0 for i in range(N)]
    tmp[hash(df.ix[col]) % N] = 1
    res = df.append(pd.Series(tmp,index=cols))
    return res.drop(col)

a = pd.Series(['new york',30,''],index=['city','age','test'])
b = pd.Series(['boston',30,''],index=['city','age','test'])

print hash_col(a,'city',10)
print hash_col(b,'city',10)

This will work per single Series, column name will be assumed to be a Pandas index. It also replaces blank strings with nan, and floats everything.

age        30
test      NaN
city_0      0
city_1      0
city_2      0
city_3      0
city_4      0
city_5      0
city_6      0
city_7      1
city_8      0
city_9      0
dtype: object
age        30
test      NaN
city_0      0
city_1      0
city_2      0
city_3      0
city_4      0
city_5      1
city_6      0
city_7      0
city_8      0
city_9      0
dtype: object

If, however, there is a vocabulary, and you simply want to one-hot-encode, you could use

import numpy as np
import pandas as pd, os
import scipy.sparse as sps

def hash_col(df, col, vocab):
    cols = [col + "=" + str(v) for v in vocab]
    def xform(x): tmp = [0 for i in range(len(vocab))]; tmp[vocab.index(x)] = 1; return pd.Series(tmp,index=cols)
    df[cols] = df[col].apply(xform)
    return df.drop(col,axis=1)

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

df = pd.DataFrame(data)

df2 = hash_col(df, 'state', ['Ohio','Nevada'])

print sps.csr_matrix(df2)

which will give

   pop  year  state=Ohio  state=Nevada
0  1.5  2000           1             0
1  1.7  2001           1             0
2  3.6  2002           1             0
3  2.4  2001           0             1
4  2.9  2002           0             1

I also added sparsification of the final dataframe as well. In incremental setting where we might not have encountered all values beforehand (but we somehow obtained the list of all possible values somehow), the approach above can be used. Incremental ML methods would need the same number of features at each increment, hence one-hot encoding must produce the same number of rows at each batch.

Solution 2

Here (sorry I cannot add this as a comment for some reason.) Also, the first page of Feature Hashing for Large Scale Multitask Learning explains it nicely.

Solution 3

Large sparse feature can be derivate from interaction, U as user and X as email, so the dimension of U x X is memory intensive. Usually, task like spam filtering has time limitation as well.

Hash trick like other hash function store binary bits (index) which make large scale training feasible. In theory, more hashed length more performance gain, as illustrated in the original paper.

It allocate origin feature into different bucket (finite length of feature space) so that their semantic get kept. Even when spammer use typo to miss on the radar. Although there is distortion error, heir hashed form remain close.

For example,

"the quick brown fox" transform to:

h(the) mod 5 = 0

h(quick) mod 5 = 1

h(brown) mod 5 = 1

h(fox) mod 5 = 3

Use index rather then text value, saves space.

To summarize some of the applications:

dimensionality reduction for high dimension feature vector
- text in email classification task, collaborate filtering on spam
sparsification
bag-of-words on the fly
cross-product features
multi-task learning

Reference:

Origin paper:
1. Feature Hashing for Large Scale Multitask Learning
2. Shi, Q., Petterson, J., Dror, G., Langford, J., Smola, A., Strehl, A., & Vishwanathan, V. (2009). Hash kernels
What is the hashing trick
Quora
Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity search in high dimensions via hashing

Implementation:

Langford, J., Li, L., & Strehl, A. (2007). Vow- pal wabbit online learning project (Technical Report). http://hunch.net/?p=309.

17,953

Author by

Maggie

Updated on June 17, 2022

Comments

Maggie about 2 years

I know feature hashing (hashing-trick) is used to reduce the dimensionality and handle sparsity of bit vectors but I don't understand how it really works. Can anyone explain this to me.Is there any python library available to do feature hashing?

Thank you.
Fred Foo about 11 years

Unfortunately, this is not reliable as Python's hash may use a random seed (when called as python -R, by default in newer Python 3.x). Results may differ between runs of the script. See my answer for a more robust implementation.
BBSysDyn about 11 years

Please feel free to use any other hash() function in place of the simple one shown above. Besides that the snippet does everything I need - is Pandas based does column renaming, one-hot-encoding based on N, etc.
Kai about 8 years

Can you comment on the impact of feature hashing on the learned model? since there will be hash collisions. Yes I know they're improbable and minimal etc, but collisions will occur; what is the impact of these collisions on the learned model? any pointers to research that looks into this question is appreciated. One thing is clear, the learned model from hashed features is NOT guaranteed to be the same Model you get from the original un-hashed features. How do they differ and to what degree?
CKM over 7 years

Can you comment on choosing the number of dimensions to pick for feature hashing?
CKM over 7 years

@user423805. Can you please rewrite the first code for one more column in the dictionary 'country':['us','ohio','pak','india','india']? I'm new to python at the moment and learning it. I just want to verify my understanding from the original paper by Langford et al.
CodeFarmer over 7 years

@Kai I'v added original paper on this topic. Error boundary was analysed so does empirical results. Please take a look.
Stanleyrr about 6 years

@user423805, I know this post has already been closed but may I ask you one question about the answer you provided to Maggie? My question is why did you store the 'state' variable as 4 features (state_0, state_1, state_2 and state_3) instead of storing it as say 5 or 6 features?
Stanleyrr about 6 years

May I ask why does 'h(quick) mod 5 ' and 'h(brown) mod 5' are both equal to 1 when 'quick' and 'brown' are different words?