Python - 'ascii' codec can't decode byte

318,004

Solution 1

"你好".encode('utf-8')

encode converts a unicode object to a string object. But here you have invoked it on a string object (because you don't have the u). So python has to convert the string to a unicode object first. So it does the equivalent of

"你好".decode().encode('utf-8')

But the decode fails because the string isn't valid ascii. That's why you get a complaint about not being able to decode.

Solution 2

Always encode from unicode to bytes.
In this direction, you get to choose the encoding.

>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print _
你好

The other way is to decode from bytes to unicode.
In this direction, you have to know what the encoding is.

>>> bytes = '\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print bytes
你好
>>> bytes.decode('utf-8')
u'\u4f60\u597d'
>>> print _
你好

This point can't be stressed enough. If you want to avoid playing unicode "whack-a-mole", it's important to understand what's happening at the data level. Here it is explained another way:

  • A unicode object is decoded already, you never want to call decode on it.
  • A bytestring object is encoded already, you never want to call encode on it.

Now, on seeing .encode on a byte string, Python 2 first tries to implicitly convert it to text (a unicode object). Similarly, on seeing .decode on a unicode string, Python 2 implicitly tries to convert it to bytes (a str object).

These implicit conversions are why you can get UnicodeDecodeError when you've called encode. It's because encoding usually accepts a parameter of type unicode; when receiving a str parameter, there's an implicit decoding into an object of type unicode before re-encoding it with another encoding. This conversion chooses a default 'ascii' decoder, giving you the decoding error inside an encoder.

In fact, in Python 3 the methods str.decode and bytes.encode don't even exist. Their removal was a [controversial] attempt to avoid this common confusion.

...or whatever coding sys.getdefaultencoding() mentions; usually this is 'ascii'

Solution 3

You can try this

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

Or

You can also try following

Add following line at top of your .py file.

# -*- coding: utf-8 -*- 

Solution 4

If you're using Python < 3, you'll need to tell the interpreter that your string literal is Unicode by prefixing it with a u:

Python 2.7.2 (default, Jan 14 2012, 23:14:09) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "你好".encode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'

Further reading: Unicode HOWTO.

Solution 5

You use u"你好".encode('utf8') to encode an unicode string. But if you want to represent "你好", you should decode it. Just like:

"你好".decode("utf8")

You will get what you want. Maybe you should learn more about encode & decode.

Share:
318,004
thoslin
Author by

thoslin

Updated on November 17, 2020

Comments

  • thoslin
    thoslin over 3 years

    I'm really confused. I tried to encode but the error said can't decode....

    >>> "你好".encode("utf8")
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
    

    I know how to avoid the error with "u" prefix on the string. I'm just wondering why the error is "can't decode" when encode was called. What is Python doing under the hood?

  • thoslin
    thoslin about 12 years
    So do you mean that Python decodes the bytestring before encoding?
  • MxLDevs
    MxLDevs about 12 years
    If you're encoding a string, why does it throw a decode error?
  • Jon Tirsen
    Jon Tirsen about 11 years
    So what is the solution? Especially if I don't have a string literal, I just have a string object.
  • Winston Ewert
    Winston Ewert about 11 years
    @JonTirsen, you should not be encoding a string object. A string object is already encoded. If you need to change the encoding, you need to decode it into a unicode string and then encode it as the desired encoding.
  • deinonychusaur
    deinonychusaur almost 11 years
    So to state it clearly from above you can "你好".decode('utf-8').encode('utf-8')
  • deinonychusaur
    deinonychusaur almost 11 years
    @WinstonEwert I guess I was confused. The encoding business tend to leave me eternally confused. I guess my confusion came from my own problem of not knowing the if the input is a string or unicode string and what encoding it may have.
  • Winston Ewert
    Winston Ewert almost 11 years
    @deinonychusaur, yeah... I get that.
  • wim
    wim almost 10 years
    @thoslin exactly, I added more details.
  • SIslam
    SIslam over 8 years
    Fantastic aphorism mate!! all are lucid now.. I did not know that encode converts an unicode into STRING (i.e. without leading u)!! But this string is like \xe4\xbd\xa0\xe5\xa5\xbd byte but if you print them they print fine as @wim explained
  • NoBugs
    NoBugs over 6 years
    What is _, and why are your print statements missing parenthesis?
  • wim
    wim over 6 years
    @NoBugs 1. in the REPL, _ refers to the previous value 2. because this is a python-2.x question.
  • shleimel
    shleimel almost 3 years
    @MxLDevs because you can't get a decode error on an encode action.
  • Alexey
    Alexey about 2 years
    This must be accepted answer!
  • Alexander Samoylov
    Alexander Samoylov almost 2 years
    Thanks for the hint. For my case the right solution was just .decode('utf-8'). I ran Subprocess.popen(...).communicate() which returned bytes containing German characters ä, ö, ü and the normal .decode() (without 'utf-8' parameter) failed. With 'utf-8' parameter it works.