What is internal representation of string in Python 3.x

python string unicode python-3.x

15,992

Solution 1

There has been NO CHANGE in Unicode internal representation between Python 2.X and 3.X.

It's definitely NOT UTF-16. UTF-anything is a byte-oriented EXTERNAL representation.

Each code unit (character, surrogate, etc) has been assigned a number from range(0, 2 ** 21). This is called its "ordinal".

Really, the documentation you quoted says it all. Most Python binaries use 16-bit ordinals which restricts you to the Basic Multilingual Plane ("BMP") unless you want to muck about with surrogates (handy if you can't find your hair shirt and your bed of nails is off being de-rusted). For working with the full Unicode repertoire, you'd prefer a "wide build" (32 bits wide).

Briefly, the internal representation in a unicode object is an array of 16-bit unsigned integers, or an array of 32-bit unsigned integers (using only 21 bits).

Solution 2

The internal representation will change in Python 3.3 which implements PEP 393. The new representation will pick one or several of ascii, latin-1, utf-8, utf-16, utf-32, generally trying to get a compact representation.

Implicit conversions into surrogate pairs will only be done when talking to legacy APIs (those only exist on windows, where wchar_t is two bytes); the Python string will be preserved. Here are the release notes.

Solution 3

In Python 3.3 and above, the internal representation of the string will depend on the string, and can be any of latin-1, UCS-2 or UCS-4, as described in PEP 393.

For previous Pythons, the internal representation depends on the build flags of Python. Python can be built with flag values --enable-unicode=ucs2 or --enable-unicode=ucs4. ucs2 builds do in fact use UTF-16 as their internal representation, and ucs4 builds use UCS-4 / UTF-32.

Solution 4

Looking at the source code for CPython 3.1.5, in Include/unicodeobject.h:

/* --- Unicode Type ------------------------------------------------------- */

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;          /* Length of raw Unicode data in buffer */
    Py_UNICODE *str;            /* Raw Unicode buffer */
    long hash;                  /* Hash value; -1 if not set */
    int state;                  /* != 0 if interned. In this case the two
                                 * references from the dictionary to this object
                                 * are *not* counted in ob_refcnt. */
    PyObject *defenc;           /* (Default) Encoded version as Python
                                   string, or NULL; this is used for
                                   implementing the buffer protocol */
} PyUnicodeObject;

The characters are stored as an array of Py_UNICODE. On most platforms, I believe Py_UNICODE is #defined as wchar_t.

Solution 5

It depends: see here. This is still true for Python 3 as far as internal representation goes.

View more solutions

15,992

Author by

thebat

merge keep

Updated on June 11, 2022

Comments

thebat almost 2 years

In Python 3.x, a string consists of items of Unicode ordinal. (See the quotation from the language reference below.) What is the internal representation of Unicode string? Is it UTF-16?

The items of a string object are Unicode code units. A Unicode code unit is represented by a string object of one item and can hold either a 16-bit or 32-bit value representing a Unicode ordinal (the maximum value for the ordinal is given in sys.maxunicode, and depends on how Python is configured at compile time). Surrogate pairs may be present in the Unicode object, and will be reported as two separate items.
Joachim Sauer over 14 years

"Storing the unicode codeponts in 16 bit integers" is called "UCS-2". Doing the same thing with 32 bit integers is UCS-4.
rogerdpack over 12 years

so if you compile with UCS-4 mode on, then you won't have to worry about surrogate pairs at all?
John Machin over 12 years

@Joachim Sauer: the UCS-n are actually ancient encodings, not ways of storing the Unicode codepoints in (n*8)-bit integers. UCS-2 (which is limited to the BMP) was superceded by UTF-16. See e.g. en.wikipedia.org/wiki/UTF-16/UCS-2
John Machin over 12 years

@rogerdpack: That's the general idea. The preferred terminology is "narrow/wide Py_Unicode", not "UCS-n".
Joachim Sauer over 12 years

@John: I don't understand what you're trying to say. What is an encoding if it's not a "way of storing Unicode codepoints" (or character information, more generally). And yes, I'm well aware that UTF-16 is the modern alternative to UCS-2, but it's not the same as you say. UTF-16 supports all of Unicode, but UCS-2 only the BMP.
gwideman about 10 years

Looks to me like PEP 393 says the internal representation is the most compact (given a particular string) of ASCII, Latin-1 (UCS1), UCS2 or UCS4. Ie: specifically NOT utf-8/16/32. The reason: Python must be constant time to index into a string, hence characters must be uniform size, which is the case for UCS, but not for utf representations.
gwideman about 10 years

"There has been NO CHANGE...". Actually, see PEP 393 (Jan 2010), which spells out the change that subsequently came about, and another answer here "The internal representation will change..."
dotancohen over 9 years

This answer is wrong, see Tobu's answer below. PEP 393 was written one month after this answer was given.
Bjarke Ebert almost 9 years

"UTF-anything is a byte-oriented EXTERNAL representation". In some systems it is also a valid INTERNAL representation. For example, in many C++ based systems, UTF-8 is used internally, not only for I/O. And Go specifically uses UTF-8 as internal string representation.
user1601201 over 6 years

Latin-1 is a superset of ASCII, so there is no reason to include ASCII as one of the options. The options are (a) uniformly 8-bit, i.e. Latin-1, (b) uniformly 16-bit, i.e. UCS2, or (c) uniformly 32-bit, i.e. UCS4 (which is the same as UTF-32). Notably excluded are UTF-8 and UTF-16, which do not have a uniform number of bits per code point
jcox almost 6 years

This answer seems to have confused Unicode with UTF-16, and code units with code points. See this stack overflow question for some clarification on the latter.
shripal mehta about 3 years

can you please elaborate what this piece of code does? especially the part that is not in english Привет мир!
scravy about 3 years

This is the correct answer and should be the accepted one.
Adam Jenča over 2 years

That Russian piece was used to describe what happens when you use characters outside ASCII. By the way, it's 'Hello world!' in Russian.