How to decode a numpy array of encoded literals/strings in Python3? AttributeError: 'numpy.ndarray' object has no attribute 'decode'

arrays python-3.x pandas numpy unicode

24,374

Solution 1

You have an array of bytestrings; dtype is S:

In [338]: arr=np.array((b'first_element', b'element'))
In [339]: arr
Out[339]: 
array([b'first_element', b'element'], 
      dtype='|S13')

astype easily converts them to unicode, the default string type for Py3.

In [340]: arr.astype('U13')
Out[340]: 
array(['first_element', 'element'], 
      dtype='<U13')

There is also a library of string functions - applying the corresponding str method to the elements of a string array

In [341]: np.char.decode(arr)
Out[341]: 
array(['first_element', 'element'], 
      dtype='<U13')

The astype is faster, but the decode lets you specify an encoding.

Solution 2

If you want the result to be a (Python) list of strings, you can use a list comprehension:

>>> l = [el.decode('UTF-8') for el in array1]
>>> print(l)
['element', 'element 2']
>>> print(type(l))
<class 'list'>

Alternatively, if you want to keep it as a Numpy array, you can use np.vectorize to make a vectorized decoder function:

>>> decoder = np.vectorize(lambda x: x.decode('UTF-8'))
>>> array2 = decoder(array1)
>>> print(array2)
['element' 'element 2']
>>> print(type(array2))
<class 'numpy.ndarray'>

24,374

Author by

ShanZhengYang

Updated on April 14, 2020

Comments

ShanZhengYang about 4 years
In Python 3, I have the follow NumPy array of strings.

Each string in the NumPy array is in the form b'MD18EE instead of MD18EE.

For example:
```
import numpy as np
print(array1)
(b'first_element', b'element',...)
```
Normally, one would use .decode('UTF-8') to decode these elements.

However, if I try:
```
array1 = array1.decode('UTF-8')
```
I get the following error:
```
AttributeError: 'numpy.ndarray' object has no attribute 'decode'
```
How do I decode these elements from a NumPy array? (That is, I don't want b'')

EDIT:

Let's say I was dealing with a Pandas DataFrame with only certain columns that were encoded in this manner. For example:
```
import pandas as pd
df = pd.DataFrame(...)

df
        COL1          ....
0   b'entry1'         ...
1   b'entry2'
2   b'entry3'
3   b'entry4'
4   b'entry5'
5   b'entry6'
```
ShanZhengYang over 7 years

Thanks! I'm taking the numpy array and putting it into a pandas dataframe. Maybe there are quicker shortcuts? Convert by column?
Wander Nauta over 7 years

Do you mean quicker as in 'runs faster' or quicker as in 'less code'? Because both methods are oneliners, the print statements are just to show that they work :)
ShanZhengYang over 7 years

:) I was thinking run faster. However, I think this method works fine---this appears to be a Python2/Python3 side effect, so I suspect others have run into this issue.
ShanZhengYang over 7 years

In any sense, using decoder gives me this error: AttributeError: 'numpy.void' object has no attribute 'decode'
Wander Nauta over 7 years

Hmm, in that case, it looks like your array is not an array of strings at all, but rather an array of strings and voids - but I'm sure you'll be able to modify the decoder to handle those as well. At any rate, I think the best (and probably fastest) way to approach this would be to make sure you use strings everywhere, rather than bytes. How you would do that depends on where your data is coming from and how you read it.
Wander Nauta over 7 years

You would only decode the things that are bytes, and ignore the voids: lambda x: x.decode('UTF-8') if isinstance(x, bytes) else x However, like I said, it would be nicer to deal with elsewhere.
hpaulj over 7 years

np.void is probably a record from a structured array, a compound dtype. What is the dtype of your array.
John Jiang over 2 years

The astype method seems too specific with the byte length information. For instance what if my input dtype is '|S1' rather than '|S13'?
hpaulj over 2 years

@John, it looks like we don't have to specify the length: np.array('one', 'S7').astype('U')
John Jiang over 2 years

I tried astype('U') on some bytearray and got UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128). However np.char.decode(arr) worked alright.