How to decode a numpy array of encoded literals/strings in Python3? AttributeError: 'numpy.ndarray' object has no attribute 'decode'

24,374

Solution 1

You have an array of bytestrings; dtype is S:

In [338]: arr=np.array((b'first_element', b'element'))
In [339]: arr
Out[339]: 
array([b'first_element', b'element'], 
      dtype='|S13')

astype easily converts them to unicode, the default string type for Py3.

In [340]: arr.astype('U13')
Out[340]: 
array(['first_element', 'element'], 
      dtype='<U13')

There is also a library of string functions - applying the corresponding str method to the elements of a string array

In [341]: np.char.decode(arr)
Out[341]: 
array(['first_element', 'element'], 
      dtype='<U13')

The astype is faster, but the decode lets you specify an encoding.

See also How to decode a numpy array of dtype=numpy.string_?

Solution 2

If you want the result to be a (Python) list of strings, you can use a list comprehension:

>>> l = [el.decode('UTF-8') for el in array1]
>>> print(l)
['element', 'element 2']
>>> print(type(l))
<class 'list'>

Alternatively, if you want to keep it as a Numpy array, you can use np.vectorize to make a vectorized decoder function:

>>> decoder = np.vectorize(lambda x: x.decode('UTF-8'))
>>> array2 = decoder(array1)
>>> print(array2)
['element' 'element 2']
>>> print(type(array2))
<class 'numpy.ndarray'>
Share:
24,374
ShanZhengYang
Author by

ShanZhengYang

Updated on April 14, 2020

Comments

  • ShanZhengYang
    ShanZhengYang about 4 years

    In Python 3, I have the follow NumPy array of strings.

    Each string in the NumPy array is in the form b'MD18EE instead of MD18EE.

    For example:

    import numpy as np
    print(array1)
    (b'first_element', b'element',...)
    

    Normally, one would use .decode('UTF-8') to decode these elements.

    However, if I try:

    array1 = array1.decode('UTF-8')
    

    I get the following error:

    AttributeError: 'numpy.ndarray' object has no attribute 'decode'
    

    How do I decode these elements from a NumPy array? (That is, I don't want b'')

    EDIT:

    Let's say I was dealing with a Pandas DataFrame with only certain columns that were encoded in this manner. For example:

    import pandas as pd
    df = pd.DataFrame(...)
    
    df
            COL1          ....
    0   b'entry1'         ...
    1   b'entry2'
    2   b'entry3'
    3   b'entry4'
    4   b'entry5'
    5   b'entry6'
    
  • ShanZhengYang
    ShanZhengYang over 7 years
    Thanks! I'm taking the numpy array and putting it into a pandas dataframe. Maybe there are quicker shortcuts? Convert by column?
  • Wander Nauta
    Wander Nauta over 7 years
    Do you mean quicker as in 'runs faster' or quicker as in 'less code'? Because both methods are oneliners, the print statements are just to show that they work :)
  • ShanZhengYang
    ShanZhengYang over 7 years
    :) I was thinking run faster. However, I think this method works fine---this appears to be a Python2/Python3 side effect, so I suspect others have run into this issue.
  • ShanZhengYang
    ShanZhengYang over 7 years
    In any sense, using decoder gives me this error: AttributeError: 'numpy.void' object has no attribute 'decode'
  • Wander Nauta
    Wander Nauta over 7 years
    Hmm, in that case, it looks like your array is not an array of strings at all, but rather an array of strings and voids - but I'm sure you'll be able to modify the decoder to handle those as well. At any rate, I think the best (and probably fastest) way to approach this would be to make sure you use strings everywhere, rather than bytes. How you would do that depends on where your data is coming from and how you read it.
  • Wander Nauta
    Wander Nauta over 7 years
    You would only decode the things that are bytes, and ignore the voids: lambda x: x.decode('UTF-8') if isinstance(x, bytes) else x However, like I said, it would be nicer to deal with elsewhere.
  • hpaulj
    hpaulj over 7 years
    np.void is probably a record from a structured array, a compound dtype. What is the dtype of your array.
  • John Jiang
    John Jiang over 2 years
    The astype method seems too specific with the byte length information. For instance what if my input dtype is '|S1' rather than '|S13'?
  • hpaulj
    hpaulj over 2 years
    @John, it looks like we don't have to specify the length: np.array('one', 'S7').astype('U')
  • John Jiang
    John Jiang over 2 years
    I tried astype('U') on some bytearray and got UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128). However np.char.decode(arr) worked alright.