numpy convert categorical string arrays to an integer array
Solution 1
Well, this is a hack... but does it help?
In [72]: c=(a.view(np.ubyte)-96).astype('int32')
In [73]: print(c,c.dtype)
(array([1, 2, 3, 1, 2, 3]), dtype('int32'))
Solution 2
np.unique has some optional returns
return_inverse gives the integer encoding, which I use very often
>>> b, c = np.unique(a, return_inverse=True)
>>> b
array(['a', 'b', 'c'],
dtype='|S1')
>>> c
array([0, 1, 2, 0, 1, 2])
>>> c+1
array([1, 2, 3, 1, 2, 3])
it can be used to recreate the original array from uniques
>>> b[c]
array(['a', 'b', 'c', 'a', 'b', 'c'],
dtype='|S1')
>>> (b[c] == a).all()
True
Solution 3
... years later....
For completeness (because this isn't mentioned in the answers) and personal reasons (I always have pandas
imported in my modules but not necessarily sklearn
), this is also quite straightforward with pandas.get_dummies()
import numpy as np
import pandas
In [1]: a = np.array(['a', 'b', 'c', 'a', 'b', 'c'])
In [2]: b = pandas.get_dummies(a)
In [3]: b
Out[3]:
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 0 1 0
5 0 0 1
In [3]: b.values.argmax(1)
Out[4]: array([0, 1, 2, 0, 1, 2])
Solution 4
One way is to use the categorical
function from scikits.statsmodels. For example:
In [60]: from scikits.statsmodels.tools import categorical
In [61]: a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])
In [62]: b = categorical(a, drop=True)
In [63]: b.argmax(1)
Out[63]: array([0, 1, 2, 0, 1, 2])
The return value from categorical
(b
) is actually a design matrix, hence the call to argmax
above to get it close to your desired format.
In [64]: b
Out[64]:
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
Solution 5
Another option is to use a categorical pandas Series:
>>> import pandas as pd
>>> pd.Series(['a', 'b', 'c', 'a', 'b', 'c'], dtype="category").cat.codes.values
array([0, 1, 2, 0, 1, 2], dtype=int8)
Related videos on Youtube
wroscoe
Updated on July 09, 2022Comments
-
wroscoe almost 2 years
I'm trying to convert a string array of categorical variables to an integer array of categorical variables.
Ex.
import numpy as np a = np.array( ['a', 'b', 'c', 'a', 'b', 'c']) print a.dtype >>> |S1 b = np.unique(a) print b >>> ['a' 'b' 'c'] c = a.desired_function(b) print c, c.dtype >>> [1,2,3,1,2,3] int32
I realize this can be done with a loop but I imagine there is an easier way. Thanks.
-
smci almost 11 yearsYou seriously want to add the caveat that this only works for length-1 strings.
-
SeeTheC about 7 yearsThanks. Finally found the answer which I'm looking for.
-
bib about 3 yearsHow can we can we know that 'a' is '0' and so on. There is any code that can return such that?
-
Tim Skov Jacobsen over 2 years@bib: I believe a new running number/index is allocated every time a new string is encountered while traversing the array from left to right. So
'a'
is0
because it was the first string that was seen.