I don't understand encode and decode in Python (2.7.3)

26,407

Solution 1

It's a little more complex in Python 2 (compared to Python 3), since it conflates the concepts of 'string' and 'bytestring' quite a bit, but see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Essentially, what you need to understand is that 'string' and 'character' are abstract concepts that can't be directly represented by a computer. A bytestring is a raw stream of bytes straight from disk (or that can be written straight from disk). encode goes from abstract to concrete (you give it preferably a unicode string, and it gives you back a byte string); decode goes the opposite way.

The encoding is the rule that says 'a' should be represented by the byte 0x61 and 'α' by the two-byte sequence 0xc0\xb1.

Solution 2

My presentation from PyCon, Pragmatic Unicode, or, How Do I Stop The Pain covers all of these details.

Briefly, Unicode strings are sequences of integers called code points, and bytestrings are sequences of bytes. An encoding is a way to represent Unicode code points as a series of bytes. So unicode_string.encode(enc) will return the byte string of the Unicode string encoded with "enc", and byte_string.decode(enc) will return the Unicode string created by decoding the byte string with "enc".

Solution 3

Python 2.x has two types of strings:

  • str = "byte strings" = a sequence of octets. These are used for both "legacy" character encodings (such as windows-1252 or IBM437) and for raw binary data (such as struct.pack output).
  • unicode = "Unicode strings" = a sequence of UTF-16 or UTF-32 depending on how Python is built.

This model was changed for Python 3.x:

  • 2.x unicode became 3.x str (and the u prefix was dropped from the literals).
  • A bytes type was introduced for representing binary data.

A character encoding is a mapping between Unicode strings and byte strings. To convert a Unicode string, to a byte string, use the encode method:

>>> u'\u20AC'.encode('UTF-8')
'\xe2\x82\xac'

To convert the other way, use the decode method:

>>> '\xE2\x82\xAC'.decode('UTF-8')
u'\u20ac'

Solution 4

Yes, a byte string is an octet string. Encoding and decoding happens when inputting / outputting text (from/to the console, files, the network, ...). Your console may use UTF-8 internally, your web server serves latin-1, and certain file formats need strange encodings like Bibtex's accents: fran\c{c}aise. You need to convert from/to them on input/output.

The {en|de}code methods do this. They are often called behind the scenes (for example, print "hello world" encodes the string to whatever your terminal uses).

Share:
26,407
Narcisse Doudieu Siewe
Author by

Narcisse Doudieu Siewe

Updated on July 17, 2020

Comments

  • Narcisse Doudieu Siewe
    Narcisse Doudieu Siewe almost 4 years

    I tried to understand by myself encode and decode in Python but nothing is really clear for me.

    1. str.encode([encoding,[errors]])
    2. str.decode([encoding,[errors]])

    First, I don't understand the need of the "encoding" parameter in these two functions.

    What is the output of each function, its encoding? What is the use of the "encoding" parameter in each function? I don't really understand the definition of "bytes string".

    I have an important question, is there some way to pass from one encoding to another? I have read some text on ASN.1 about "octet string", so I wondered whether it was the same as "bytes string".

    Thanks for you help.