ascii codec cant decode byte 0xe9

python unicode encoding utf-8 decode

37,997

Solution 1

You are trying to encode bytestrings:

>>> '<counter name="Entreé">'.encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 20: ordinal not in range(128)

Python is trying to be helpful, you can only encode a Unicode string to bytes, so to encode Python first implictly decodes, using the default encoding.

The solution is to not encode data that is already encoded, or first decode using a suitable codec before trying to encode again, if the data was encoded to a different codec than what you needed.

If you have a mix of unicode and bytestring values, decode just the bytestrings or encode just the unicode values; try to avoid mixing the types. The following decodes byte strings to unicode first:

def ensure_unicode(v):
    if isinstance(v, str):
        v = v.decode('utf8')
    return unicode(v)  # convert anything not a string to unicode too

output_string = u'\n'.join([ensure_unicode(line) for line in output_lines])

Solution 2

A simple example of the problem is:

>>> '\xe9'.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)

\xe9 isn't an ascii character which means that your string is already encoded. You need to decode it into python's unicode and then encode it again in the serialization format you want.

Since I don't know where your string came from, I just peeked at the python codecs, picked something from Western Europe and gave it a go:

>>> '\xe9'.decode('cp1252')
u'\xe9'
>>> u'\xe9'.encode('utf-8')
'\xc3\xa9'
>>>

You'll have the best luck if you know exactly which encoding the file came from.

Solution 3

encode = turn a unicode string into a bytestring

decode = turn a bytestring into unicode

since you already have a bytestring you need decode to make it a unicode instance (assuming that is actually what you are trying to do)

output_string = '\n'.join(output_lines)
print output_string.decode("latin1")  #now this returns unicode

37,997

Author by

iqueqiorio

Updated on March 10, 2020

Comments

iqueqiorio about 4 years
I have done some research and seen solutions but none have worked for me.

Python - 'ascii' codec can't decode byte

This didn't work for me. And I know the 0xe9 is the é character. But I still can't figure out how to get this working, here is my code
```
output_lines = ['<menu>', '<day name="monday">', '<meal name="BREAKFAST">', '<counter name="Entreé">', '<dish>', '<name icon1="Vegan" icon2="Mindful Item">', 'Cream of Wheat (Farina)','</name>', '</dish>', '</counter >', '</meal >', '</day >', '</menu >']
output_string = '\n'.join([line.encode("utf-8") for line in output_lines])
```
And this give me the error ascii codec cant decode byte 0xe9

And I have tried decoding, I have tried to replace the "é" but can't seem to get that to work either.
Joran Beasley about 9 years

afaik this also indicates he is using python2x ... since in 3x it no longer tries to implicitly convert things and you get a much clearer error (+1 ofc)
Joran Beasley about 9 years

or just "\n".join(output_lines)
iqueqiorio about 9 years

@JoranBeasley and Martijn when I change it to output_string = '\n'.join([line for line in output_lines]) I still get the same error?
Martijn Pieters about 9 years

@iqueqiorio: do you have a mix of Unicode and byte strings in your list?
Mazdak about 9 years

@JoranBeasley Yeah! sorry i miss your answer!
iqueqiorio about 9 years

@MartijnPieters I don't think, so it is a long list is there a way to check with an if statement
Joran Beasley about 9 years

then you need to post the actual input that is causing an error ... maybe put it in a dpaste ... but as it is we cannot replicate your issue ... and you should post a full traceback ...
Martijn Pieters about 9 years

@iqueqiorio: that's not a link to a gist; don't worry though, I have it covered.
Joran Beasley about 9 years

@MartijnPieters thats a good solution :) (one i have had to use before ... I still think its better to have well formed input)
iqueqiorio about 9 years

@MartijnPieters I got the same error and an error on the line v = v.decode("utf8")
Joran Beasley about 9 years

surely not UnicodeDecodeError: ascii codec cannot ...
iqueqiorio about 9 years

I get Unicode Decode: 'utf8' codec can't decode byte
Joran Beasley about 9 years

try v.decode("latin1") ... this is where its really handy to know the encoding you are using ahead of time ;P ... just wait till you get JIS encodings
Martijn Pieters about 9 years

@iqueqiorio: right, because you never specified what codec your data is encoded in, and I picked a common default for XML data. Where did the data come from? Do you have any more context that would let you determine the correct codec?
Martijn Pieters about 9 years

@JoranBeasley: or cp1252; neither will fail but may not produce readable output if it is the wrong codec.
Joran Beasley about 9 years

"\xe9".decode("utf8") == ERROR however in latin1 it is acute e (as noted by @MartijnPieters it also works decoding with "cp1252" ... and if you pick the wrong one you will get problems)
Martijn Pieters about 9 years

@iqueqiorio: then the web server can have provided you with the codec, or the XML format itself could have included the codec in the metadata.
iqueqiorio about 9 years

@MartijnPieters okay where could I find that info, of what codec they use?
Martijn Pieters about 9 years

@iqueqiorio: depends; see retrieve links from web page using python and BeautifulSoup for sample code that retrieves the codec if available in the headers. Note that BeautifulSoup will find codec info in the document itself as needed.