Series' object has no attribute 'decode in pandas
I could be wrong but I would guess that what you have are byte strings rather than strings of bytes strings b"XXXXX"
instead of "b'XXXXX'"
as you've posted in your answer in which case you could do the following (you need to use the string accessor):
preparedData['text'] = preparedData['text'].str.decode('utf8')
Edit: Looks like my assumption was wrong, in which case you can do a pre-processing step:
import ast
preparedData['text'] = preparedData['text'].apply(ast.literal_eval).str.decode("utf-8")
Related videos on Youtube
Kabilesh
Updated on June 04, 2022Comments
-
Kabilesh almost 2 years
I am trying to decode utf-8 encoded text in python. The data is loaded to a pandas data frame and then I decode. This produces an error: AttributeError: 'Series' object has no attribute 'decode'. How can I properly decode the text that is in pandas column?
>> preparedData.head(5).to_dict( ) {'id': {0: 1042616899408945154, 1: 1042592536769044487, 2: 1042587702040903680, 3: 1042587263643930626, 4: 1042586780292276230}, 'date': {0: '2018-09-20', 1: '2018-09-20', 2: '2018-09-20', 3: '2018-09-20', 4: '2018-09-20'}, 'time': {0: '03:30:14', 1: '01:53:25', 2: '01:34:13', 3: '01:32:28', 4: '01:30:33'}, 'text': {0: "b'\\xf0\\x9f\\x8c\\xb9 are red, violets are blue, if you want to buy us \\xf0\\x9f\\x92\\x90, here is a CLUE \\xf0\\x9f\\x98\\x89 Our #flowerpowered eye & cheek palette is AL\\xe2\\x80\\xa6 '", 1: "b'\\xf0\\x9f\\x8e\\xb5Is it too late now to say sorry\\xf0\\x9f\\x8e\\xb5 #tartetalk #memes'", 2: "b'@JillianJChase Oh no! Please email your order # to [email protected] & we can help \\xf0\\x9f\\x92\\x95'", 3: 'b"@Danikins__ It\'s best applied with our buffer brush! \\xf0\\x9f\\x92\\x9c\\xc2\\xa0"', 4: "b'@AdelaineMorin DEAD \\xf0\\x9f\\xa4\\xa3\\xf0\\x9f\\xa4\\xa3\\xf0\\x9f\\xa4\\xa3'"}, 'hasMedia': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}, 'hasHashtag': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0}, 'followers_count': {0: 801745, 1: 801745, 2: 801745, 3: 801745, 4: 801745}, 'retweet_count': {0: 17, 1: 94, 2: 0, 3: 0, 4: 0}, 'favourite_count': {0: 181, 1: 408, 2: 0, 3: 0, 4: 14}}
My data looks like the above. I want to decode the 'text' column.
ExampleText = b'\xf0\x9f\x8c\xb9 are red, violets are blue, if you want to buy us \xf0\x9f\x92\x90, here is a CLUE \xf0\x9f\x98\x89 Our #flowerpowered eye & cheek palette is AL\xe2\x80\xa6'
I could decode the text above as
ExampleText = ExampleText.decode('utf8')
However, when I try to decode text from a pandas dataframe column, I get the error. I tried like this,
preparedData['text'] = preparedData['text'].decode('utf8')
Then the error I get is,
Traceback (most recent call last): File "F:/Level 4 Research Project/makeViral/main.py", line 23, in <module> main() File "F:/Level 4 Research Project/makeViral/main.py", line 19, in main preprocessedData = preprocessData(preparedData) File "F:\Level 4 Research Project\makeViral\preprocess.py", line 34, in preprocessData preparedData['text'] = preparedData['text'].decode('utf8') File "C:\Users\Kabilesh\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 4376, in __getattr__ return object.__getattribute__(self, name) AttributeError: 'Series' object has no attribute 'decode'
I also tried
preparedData['text'] = preparedData['text'].str.decode('utf8', errors='strict')
This does not produce any error. But the resulting 'text' column is like,
'text': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
-
Kabilesh over 5 yearsI checked my data. Some start as "b'@makeupbyalishan and some b'Natural glam FTW! . I don't know how this happened. I took my data from twitter with tweepy, and data is UTF-8 encoded. What should I do?
-
Sven Harris over 5 yearsSee my edit which I think resolves the issue you're having (at least on the examples you gave). Looks to me like at some point your bytestrings have been turned into actual strings at some point in the process (maybe you saved it to a file format and read it in again?). Without more digging I couldn't be sure