How to decode a numpy array of encoded literals/strings in Python3? AttributeError: 'numpy.ndarray' object has no attribute 'decode'
Solution 1
You have an array of bytestrings; dtype is S
:
In [338]: arr=np.array((b'first_element', b'element'))
In [339]: arr
Out[339]:
array([b'first_element', b'element'],
dtype='|S13')
astype
easily converts them to unicode, the default string type for Py3.
In [340]: arr.astype('U13')
Out[340]:
array(['first_element', 'element'],
dtype='<U13')
There is also a library of string functions - applying the corresponding str
method to the elements of a string array
In [341]: np.char.decode(arr)
Out[341]:
array(['first_element', 'element'],
dtype='<U13')
The astype
is faster, but the decode
lets you specify an encoding.
See also How to decode a numpy array of dtype=numpy.string_?
Solution 2
If you want the result to be a (Python) list of strings, you can use a list comprehension:
>>> l = [el.decode('UTF-8') for el in array1]
>>> print(l)
['element', 'element 2']
>>> print(type(l))
<class 'list'>
Alternatively, if you want to keep it as a Numpy array, you can use np.vectorize
to make a vectorized decoder function:
>>> decoder = np.vectorize(lambda x: x.decode('UTF-8'))
>>> array2 = decoder(array1)
>>> print(array2)
['element' 'element 2']
>>> print(type(array2))
<class 'numpy.ndarray'>
ShanZhengYang
Updated on April 14, 2020Comments
-
ShanZhengYang about 4 years
In Python 3, I have the follow
NumPy
array ofstrings
.Each
string
in theNumPy
array is in the formb'MD18EE
instead ofMD18EE
.For example:
import numpy as np print(array1) (b'first_element', b'element',...)
Normally, one would use
.decode('UTF-8')
to decode these elements.However, if I try:
array1 = array1.decode('UTF-8')
I get the following error:
AttributeError: 'numpy.ndarray' object has no attribute 'decode'
How do I decode these elements from a
NumPy
array? (That is, I don't wantb''
)EDIT:
Let's say I was dealing with a
Pandas
DataFrame
with only certain columns that were encoded in this manner. For example:import pandas as pd df = pd.DataFrame(...) df COL1 .... 0 b'entry1' ... 1 b'entry2' 2 b'entry3' 3 b'entry4' 4 b'entry5' 5 b'entry6'
-
ShanZhengYang over 7 yearsThanks! I'm taking the numpy array and putting it into a pandas dataframe. Maybe there are quicker shortcuts? Convert by column?
-
Wander Nauta over 7 yearsDo you mean quicker as in 'runs faster' or quicker as in 'less code'? Because both methods are oneliners, the print statements are just to show that they work :)
-
ShanZhengYang over 7 years:) I was thinking run faster. However, I think this method works fine---this appears to be a Python2/Python3 side effect, so I suspect others have run into this issue.
-
ShanZhengYang over 7 yearsIn any sense, using
decoder
gives me this error:AttributeError: 'numpy.void' object has no attribute 'decode'
-
Wander Nauta over 7 yearsHmm, in that case, it looks like your array is not an array of strings at all, but rather an array of strings and
void
s - but I'm sure you'll be able to modify the decoder to handle those as well. At any rate, I think the best (and probably fastest) way to approach this would be to make sure you use strings everywhere, rather than bytes. How you would do that depends on where your data is coming from and how you read it. -
Wander Nauta over 7 yearsYou would only decode the things that are bytes, and ignore the voids:
lambda x: x.decode('UTF-8') if isinstance(x, bytes) else x
However, like I said, it would be nicer to deal with elsewhere. -
hpaulj over 7 years
np.void
is probably a record from a structured array, a compounddtype
. What is thedtype
of your array. -
John Jiang over 2 yearsThe astype method seems too specific with the byte length information. For instance what if my input dtype is '|S1' rather than '|S13'?
-
hpaulj over 2 years@John, it looks like we don't have to specify the length:
np.array('one', 'S7').astype('U')
-
John Jiang over 2 yearsI tried astype('U') on some bytearray and got UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128). However np.char.decode(arr) worked alright.