Is there a Python constant for Unicode whitespace?

13,189

Is there a Python constant for Unicode whitespace?

Short answer: No. I have personally grepped for these characters (specifically, the numeric code points) in the Python code base, and such a constant is not there.

The sections below explains why it is not necessary, and how it is implemented without this information being available as a constant. But having such a constant would also be a really bad idea.

If the Unicode Consortium added another character/code-point that is semantically whitespace, the maintainers of Python would have a poor choice between continuing to support semantically incorrect code or changing the constant and possibly breaking pre-existing code that might (inadvisably) make assumptions about the constant not changing.

How could it add these character code-points? There are 1,111,998 possible characters in Unicode. But only 120,672 are occupied as of version 8. Each new version of Unicode may add additional characters. One of these new characters might be a form of whitespace.

The information is stored in a dynamically generated C function

The code that determines what is whitespace in unicode is the following dynamically generated code.

# Generate code for _PyUnicode_IsWhitespace()
print("/* Returns 1 for Unicode characters having the bidirectional", file=fp)
print(" * type 'WS', 'B' or 'S' or the category 'Zs', 0 otherwise.", file=fp)
print(" */", file=fp)
print('int _PyUnicode_IsWhitespace(const Py_UCS4 ch)', file=fp)
print('{', file=fp)
print('    switch (ch) {', file=fp)
for codepoint in sorted(spaces):
    print('    case 0x%04X:' % (codepoint,), file=fp)
print('        return 1;', file=fp)
print('    }', file=fp)
print('    return 0;', file=fp)
print('}', file=fp)
print(file=fp)

This is a switch statement, which is a constant code block, but this information is not available as a module "constant" like the string module has. It is instead buried in the function compiled from C and not directly accessible from Python.

This is likely because as more code points are added to Unicode, we would not be able to change constants for backwards compatibility reasons.

The Generated Code

Here's the generated code currently at the tip:

int _PyUnicode_IsWhitespace(const Py_UCS4 ch)
{
    switch (ch) {
    case 0x0009:
    case 0x000A:
    case 0x000B:
    case 0x000C:
    case 0x000D:
    case 0x001C:
    case 0x001D:
    case 0x001E:
    case 0x001F:
    case 0x0020:
    case 0x0085:
    case 0x00A0:
    case 0x1680:
    case 0x2000:
    case 0x2001:
    case 0x2002:
    case 0x2003:
    case 0x2004:
    case 0x2005:
    case 0x2006:
    case 0x2007:
    case 0x2008:
    case 0x2009:
    case 0x200A:
    case 0x2028:
    case 0x2029:
    case 0x202F:
    case 0x205F:
    case 0x3000:
        return 1;
    }
    return 0;
}

Making your own constant:

The following code (from my answer here), in Python 3, generates a constant of all whitespace:

import re
import sys

s = ''.join(chr(c) for c in range(sys.maxunicode+1))
ws = ''.join(re.findall(r'\s', s))

As an optimization, you could store this in a code base, instead of auto-generating it every new process, but I would caution against assuming that it would never change.

>>> ws
'\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'

(Other answers to the question linked show how to get that for Python 2.)

Remember that at one point, some people probably thought 256 character encodings was all that we'd ever need.

>>> import string
>>> string.whitespace
' \t\n\r\x0b\x0c'

If you're insisting on keeping a constant in your code base, just generate the constant for your version of Python, and store it as a literal:

unicode_whitespace = u'\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'

The u prefix makes it unicode in Python 2 (2.7 happens to recognize the entire string above as whitespace too), and in Python 3 it is ignored as string literals are unicode by default.

Share:
13,189

Related videos on Youtube

Mark Ransom
Author by

Mark Ransom

I've been a software developer for a lot longer than I'm willing to admit. My current interests are C++ and Python on Windows, but I've been known to dabble in Linux and I try to be language agnostic when I can.

Updated on July 30, 2022

Comments

  • Mark Ransom
    Mark Ransom almost 2 years

    The string module contains a whitespace attribute, which is a string consisting of all the ASCII characters that are considered whitespace. Is there a corresponding constant that includes Unicode spaces too, such as the no-break space (U+00A0)? We can see from the question "strip() and strip(string.whitespace) give different results" that at least strip is aware of additional Unicode whitespace characters.

    This question was identified as a duplicate of In Python, how to list all characters matched by POSIX extended regex [:space:]?, but the answers to that question identify ways of searching for whitespace characters to generate your own list. This is a time-consuming process. My question was specifically about a constant.

    • 一二三
      一二三 almost 8 years
      What do you mean by "whitespace"? All of the characters with the White_Space property, or just those that are separators?
    • Russia Must Remove Putin
      Russia Must Remove Putin almost 8 years
      By the way, Python also considers File Separator, Group Separator, Record Separator, and Unit Separator to be whitespace in addition to the White_Space property list, for a total of 29 characters (as of now.)
  • Mark Ransom
    Mark Ransom almost 8 years
    When you say "we" would not be able to change, are you speaking as an official maintainer of Python? Is this generated C code part of the source?
  • Russia Must Remove Putin
    Russia Must Remove Putin almost 8 years
    @MarkRansom I'm not an official maintainer. I have pasted in and linked to the generated code, FYI.
  • Mark Ransom
    Mark Ransom almost 8 years
    I like the idea of auto-generating the list in a file that you can simply import. Something like with open('ws.py','w') as f: write(f, 'whitespace = %s\n' % repr(ws))
  • Clément
    Clément almost 5 years
    Turns out Python's autogenerated list of Unicode whitespaces is incorrect. Note that U+200B ("zero width space") is missing. There's even a bug report for it. bugs.python.org/issue10567
  • Russia Must Remove Putin
    Russia Must Remove Putin almost 5 years
    @KiranJonnalagadda that bug report was closed because the Unicode consortium, as of Unicode 4.0.1, does not consider the zero width space to be whitespace anymore, see unicode.org/versions/Unicode4.0.1 - it is considered a format character...
  • Clément
    Clément almost 5 years
    Ah! I have users copy pasting U+200B characters into text fields, so I guess I have to maintain my own list of characters to strip.
  • Punit
    Punit about 4 years
    but import unicodedata as u; print(u.bidirectional("\u3000")) gives me "WS" as result