Series' object has no attribute 'decode in pandas

14,619

I could be wrong but I would guess that what you have are byte strings rather than strings of bytes strings b"XXXXX" instead of "b'XXXXX'" as you've posted in your answer in which case you could do the following (you need to use the string accessor):

preparedData['text'] = preparedData['text'].str.decode('utf8')

Edit: Looks like my assumption was wrong, in which case you can do a pre-processing step:

import ast
preparedData['text'] = preparedData['text'].apply(ast.literal_eval).str.decode("utf-8")
Share:
14,619

Related videos on Youtube

Kabilesh
Author by

Kabilesh

Updated on June 04, 2022

Comments

  • Kabilesh
    Kabilesh almost 2 years

    I am trying to decode utf-8 encoded text in python. The data is loaded to a pandas data frame and then I decode. This produces an error: AttributeError: 'Series' object has no attribute 'decode'. How can I properly decode the text that is in pandas column?

    >> preparedData.head(5).to_dict( )
    {'id': {0: 1042616899408945154, 1: 1042592536769044487, 2: 1042587702040903680, 3: 1042587263643930626, 4: 1042586780292276230}, 'date': {0: '2018-09-20', 1: '2018-09-20', 2: '2018-09-20', 3: '2018-09-20', 4: '2018-09-20'}, 'time': {0: '03:30:14', 1: '01:53:25', 2: '01:34:13', 3: '01:32:28', 4: '01:30:33'}, 'text': {0: "b'\\xf0\\x9f\\x8c\\xb9 are red, violets are blue, if you want to buy us \\xf0\\x9f\\x92\\x90, here is a CLUE \\xf0\\x9f\\x98\\x89 Our #flowerpowered eye & cheek palette is AL\\xe2\\x80\\xa6 '", 1: "b'\\xf0\\x9f\\x8e\\xb5Is it too late now to say sorry\\xf0\\x9f\\x8e\\xb5 #tartetalk #memes'", 2: "b'@JillianJChase Oh no! Please email your order # to [email protected] & we can help \\xf0\\x9f\\x92\\x95'", 3: 'b"@Danikins__ It\'s best applied with our buffer brush! \\xf0\\x9f\\x92\\x9c\\xc2\\xa0"', 4: "b'@AdelaineMorin DEAD \\xf0\\x9f\\xa4\\xa3\\xf0\\x9f\\xa4\\xa3\\xf0\\x9f\\xa4\\xa3'"}, 'hasMedia': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}, 'hasHashtag': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0}, 'followers_count': {0: 801745, 1: 801745, 2: 801745, 3: 801745, 4: 801745}, 'retweet_count': {0: 17, 1: 94, 2: 0, 3: 0, 4: 0}, 'favourite_count': {0: 181, 1: 408, 2: 0, 3: 0, 4: 14}}
    

    My data looks like the above. I want to decode the 'text' column.

    ExampleText = b'\xf0\x9f\x8c\xb9 are red, violets are blue, if you want to buy us \xf0\x9f\x92\x90, here is a CLUE \xf0\x9f\x98\x89 Our #flowerpowered eye & cheek palette is AL\xe2\x80\xa6'

    I could decode the text above as

    ExampleText = ExampleText.decode('utf8')
    

    However, when I try to decode text from a pandas dataframe column, I get the error. I tried like this,

    preparedData['text'] = preparedData['text'].decode('utf8')
    

    Then the error I get is,

    Traceback (most recent call last):
    File "F:/Level 4 Research Project/makeViral/main.py", line 23, in <module>
    main()
    File "F:/Level 4 Research Project/makeViral/main.py", line 19, in main
    preprocessedData = preprocessData(preparedData)
    File "F:\Level 4 Research Project\makeViral\preprocess.py", line 34, in preprocessData
     preparedData['text'] = preparedData['text'].decode('utf8')
    File "C:\Users\Kabilesh\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 4376, in __getattr__
    return object.__getattribute__(self, name)
    AttributeError: 'Series' object has no attribute 'decode'
    

    I also tried

    preparedData['text'] = preparedData['text'].str.decode('utf8', errors='strict')
    

    This does not produce any error. But the resulting 'text' column is like,

    'text': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
    
  • Kabilesh
    Kabilesh over 5 years
    I checked my data. Some start as "b'@makeupbyalishan and some b'Natural glam FTW! . I don't know how this happened. I took my data from twitter with tweepy, and data is UTF-8 encoded. What should I do?
  • Sven Harris
    Sven Harris over 5 years
    See my edit which I think resolves the issue you're having (at least on the examples you gave). Looks to me like at some point your bytestrings have been turned into actual strings at some point in the process (maybe you saved it to a file format and read it in again?). Without more digging I couldn't be sure