numpy convert categorical string arrays to an integer array

40,982

Solution 1

Well, this is a hack... but does it help?

In [72]: c=(a.view(np.ubyte)-96).astype('int32')

In [73]: print(c,c.dtype)
(array([1, 2, 3, 1, 2, 3]), dtype('int32'))

Solution 2

np.unique has some optional returns

return_inverse gives the integer encoding, which I use very often

>>> b, c = np.unique(a, return_inverse=True)
>>> b
array(['a', 'b', 'c'], 
      dtype='|S1')
>>> c
array([0, 1, 2, 0, 1, 2])
>>> c+1
array([1, 2, 3, 1, 2, 3])

it can be used to recreate the original array from uniques

>>> b[c]
array(['a', 'b', 'c', 'a', 'b', 'c'], 
      dtype='|S1')
>>> (b[c] == a).all()
True

Solution 3

... years later....

For completeness (because this isn't mentioned in the answers) and personal reasons (I always have pandas imported in my modules but not necessarily sklearn), this is also quite straightforward with pandas.get_dummies()

import numpy as np
import pandas

In [1]: a = np.array(['a', 'b', 'c', 'a', 'b', 'c'])

In [2]: b = pandas.get_dummies(a)

In [3]: b
Out[3]: 
      a  b  c
   0  1  0  0
   1  0  1  0
   2  0  0  1
   3  1  0  0
   4  0  1  0
   5  0  0  1

In [3]: b.values.argmax(1)
Out[4]: array([0, 1, 2, 0, 1, 2])

Solution 4

One way is to use the categorical function from scikits.statsmodels. For example:

In [60]: from scikits.statsmodels.tools import categorical

In [61]: a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])

In [62]: b = categorical(a, drop=True)

In [63]: b.argmax(1)
Out[63]: array([0, 1, 2, 0, 1, 2])

The return value from categorical (b) is actually a design matrix, hence the call to argmax above to get it close to your desired format.

In [64]: b
Out[64]:
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

Solution 5

Another option is to use a categorical pandas Series:

>>> import pandas as pd
>>> pd.Series(['a', 'b', 'c', 'a', 'b', 'c'], dtype="category").cat.codes.values

array([0, 1, 2, 0, 1, 2], dtype=int8)
Share:
40,982

Related videos on Youtube

wroscoe
Author by

wroscoe

Updated on July 09, 2022

Comments

  • wroscoe
    wroscoe almost 2 years

    I'm trying to convert a string array of categorical variables to an integer array of categorical variables.

    Ex.

    import numpy as np
    a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])
    print a.dtype
    >>> |S1
    
    b = np.unique(a)
    print b
    >>>  ['a' 'b' 'c']
    
    c = a.desired_function(b)
    print c, c.dtype
    >>> [1,2,3,1,2,3] int32
    

    I realize this can be done with a loop but I imagine there is an easier way. Thanks.

  • smci
    smci almost 11 years
    You seriously want to add the caveat that this only works for length-1 strings.
  • SeeTheC
    SeeTheC about 7 years
    Thanks. Finally found the answer which I'm looking for.
  • bib
    bib about 3 years
    How can we can we know that 'a' is '0' and so on. There is any code that can return such that?
  • Tim Skov Jacobsen
    Tim Skov Jacobsen over 2 years
    @bib: I believe a new running number/index is allocated every time a new string is encountered while traversing the array from left to right. So 'a' is 0 because it was the first string that was seen.