How to print utf-8 to console with Python 3.4 (Windows 8)?

48,984

Solution 1

What I'm trying to do is print utf-8 card symbols (♠,♥,♦,♣) from a python module to a windows console

UTF-8 is a byte encoding of Unicode characters. ♠♥♦♣ are Unicode characters which can be reproduced in a variety of encodings and UTF-8 is one of those encodings—as a UTF, UTF-8 can reproduce any Unicode character. But there is nothing specifically “UTF-8” about those characters.

Other encodings that can reproduce the characters ♠♥♦♣ are Windows code page 850 and 437, which your console is likely to be using under a Western European install of Windows. You can print ♠ in these encodings but you are not using UTF-8 to do so, and you won't be able to use other Unicode characters that are available in UTF-8 but outside the scope of these code pages.

print(u'♠')
UnicodeEncodeError: 'charmap' codec can't encode character '\u2660'

In Python 3 this is the same as the print('♠') test you did above, so there is something different about how you are invoking the script containing this print, compared to your py -3.4. What does sys.stdout.encoding give you from the script?

To get print working correctly you would have to make sure Python picks up the right encoding. If it is not doing that adequately from the terminal settings you would indeed have to set PYTHONIOENCODING to cp437.

>>> text = '♠'
>>> print(text.encode('utf-8'))
b'\xe2\x99\xa0'

print can only print Unicode strings. For other types including the bytes string that results from the encode() method, it gets the literal representation (repr) of the object. b'\xe2\x99\xa0' is how you would write a Python 3 bytes literal containing a UTF-8 encoded ♠.

If what you want to do is bypass print's implicit encoding to PYTHONIOENCODING and substitute your own, you can do that explicitly:

>>> import sys
>>> sys.stdout.buffer.write('♠'.encode('cp437'))

This will of course generate wrong output for any consoles not running code page 437 (eg non-Western-European installs). Generally, for apps using the C stdio, like Python does, getting non-ASCII characters to the Windows console is just too unreliable to bother with.

Solution 2

Since Python 3.7.x, You can reconfigure stdout :

import sys
sys.stdout.reconfigure(encoding='utf-8')

Solution 3

Do not encode to utf-8; print Unicode directly instead:

print(u'♠')

See how to print Unicode to Windows console.

Share:
48,984
Austin A
Author by

Austin A

I <3 DATA!

Updated on December 14, 2021

Comments

  • Austin A
    Austin A over 2 years

    I'm trying to print utf-8 card symbols (♠,♥,♦︎︎,♣) from a python module to a windows console. The console that I'm using is git bash and I'm using console2 as a front-end. I've tried/read a number of approaches below and nothing has worked so far.

    • Made sure the console can handle utf-8 characters. These two tests make me believe that the console isn't the problem.

      enter image description here

    • Attempt the same thing from the python module.
      When I execute the .py, this is the result.

       print(u'♠')
       UnicodeEncodeError: 'charmap' codec can't encode character '\u2660' in position 0: character maps to <undefined>
      
    • Attempt to encode ♠. This gives me back the unicode set encoded in utf-8, but still no spade symbol.

       text = '♠'
       print(text.encode('utf-8'))
       b'\xe2\x99\xa0'
      

    I feel like I'm missing a step or not understanding the whole encode/decode process. I've read this, this, and this. The last of the pages suggests wrapping the sys.stdout into the code but this article says using stdout is unnecessary and points to another page using the codecs module.

  • jfs
    jfs almost 9 years
    It is not how it works. As win_unicode_console demonstrates you can write any Unicode character (though only BMP characters will be display by Windows console).
  • user87690
    user87690 almost 9 years
    @J.F.Sebastian: What do you mean? With what I say you don't agree?
  • jfs
    jfs almost 9 years
    your claim is essentially that all OS I/O interfaces are bytes based. WriteConsoleW() (used by win_unicode_console) is a counter-example
  • user87690
    user87690 almost 9 years
    @J.F.Sebastian: I view WriteConsoleW as accepting a UTF-16-LE encoded bytes, not a string.
  • jfs
    jfs almost 9 years
    it is incorrect. You should separate the abstraction e.g., unicode type in Python 2 and its implementation UCS-2, UCS-4 (narrow, wide builds). The implementation can be improved e.g,. Python 3 uses flexible string representation but the abstraction stays the same. In particular, WriteConsoleW() might have started as UCS-2 but it is utf-16le now. Windows console itself is still UCS-2 i.e., you can write (and copy/paste) utf-16le but only BMP characters can be displayed (even the font support corresponding astral characters)
  • user87690
    user87690 almost 9 years
    @J.F.Sebastian: That's what I'm doing – saying that a string is an abstraction, represented by unicode type in Python 2, while when communicating with the OS environment, you cannot use this abstraction directly, you have to encode it somehow. Of course, the implementation of the unicode also uses come internal encoding.
  • jfs
    jfs almost 9 years
    Again, don't mix the abstraction and how it happens to be implemented. You can use Unicode string type directly. Compare how os.listdir(unicode_string) works on Unix (OS uses bytes interfaces) and on Windows (Unicode API). One of the improvements of Python 3 is that it uses Unicode API more on Windows.
  • user87690
    user87690 almost 9 years
    @J.F.Sebastian: By “you cannot use string directly” I meant the fact that when you want to communicate with the OS environment, you are leaving the Python realm, so you cannot use Python abstractions. So some kind of encoding is needed under the hoods. Using this obvious truth I wanted to explain what is the encoding process good for. By no means I meant that you should encode before calling print or os.listdir. In fact, I meant the opposite. As I said “the encoding is done under the hoods”, “you should just call print(a_string)”, which is exactly what you suggest.
  • jfs
    jfs almost 9 years
    wrong. Unicode is not Python abstraction. Maybe an example would help: let's take the number 1.5. We can store it in memory and on disk using binary32 format: b'\x00\x00\xc0\x3f'. If we apply your logic then there are no numeric interfaces there are only bytes interfaces. Integers are encoded as bytes too. Do you consider pid_t getpid() to be bytes interface? Byte itself is an abstraction though it is so successful and common that is indistinguishable from reality for some people.
  • user87690
    user87690 almost 9 years
    @J.F.Sebastian: I don't claim that Unicode is a Python abstraction. I also don't claim that there are only bytes interfaces. Well, it actually depeneds on what we exactly mean. Yes, integers are encoded as bytes too. And calling a C function e.g. via ctypes needs encoding of Python integers, which are a base for abstract intergers, into bytes (of particular width and endianess), which are a level lower.
  • user87690
    user87690 almost 9 years
    @J.F.Sebastian: Note that none of your objections is against anything I recommend to actually do in my answer. I think there is just some misunderstanding between us. Maybe we should contiue with the discussion somewhere else.
  • jfs
    jfs almost 9 years
    these messages are for your benefit. If you don't want to learn. It is fine by me.
  • user87690
    user87690 almost 9 years
    @J.F.Sebastian: I was addressing the fact that extended discussions should be avoided in comments. And suggesting to move our discussion somewhere else.
  • JinSnow
    JinSnow over 7 years
    Python 3.6 print('♠') (strings are UTF8 by default )
  • jfs
    jfs over 7 years
    @Guillaume wrong. Do not confuse text represented as Unicode strings and text represented as bytes using utf-8 encoding. Pep 528 and 529 has nothing to do with it.
  • JinSnow
    JinSnow over 7 years
    thanks for your correction! Do you mean that the default mode of python 3.6 is unicode (and not UTF8) right?
  • jfs
    jfs over 7 years
    @Guillaume u'♠' == '♠' in Python 3.3+ or if from __future__ import unicode_literals is used.
  • JinSnow
    JinSnow over 7 years
    for the (other) rookies: PEP 528 -- Change Windows console encoding to UTF-8 // PEP 529 -- Change Windows filesystem encoding to UTF-8
  • jfs
    jfs over 7 years
    @Guillaume: to be crystal clear: utf-8 is NOT synonym for Unicode. It is just one of many character encodings. The fact that utf-8 is mentioned in both the question and the peps is a coincidence e.g., UnicodeEncodeError may be fixed by installing win-unicode-console package that does not use utf-8 anywhere (it works with Unicode strings directly). Follow the link in the answer.
  • JinSnow
    JinSnow over 7 years
    @sebastian thanks to your patience I finally understood that point. I still don't understand why I can't print "ç" (french letter) in the console (U+00E7) but If I understood well, it can't be fixed (without risking to get some weird bugs). Do you confirm? (I'm running python 3.6)
  • jfs
    jfs over 7 years
    @Guillaume no. You should be able to print any Unicode character (and even to display it correctly if you've configured the font (BMP-only)) Have you read the linked answer? If it doesn't help; create a minimal code example such as print('\xe7') and post it as a new Stack Overflow question with the full traceback.
  • JinSnow
    JinSnow over 7 years
    Python 3.6: "the default console on Windows accept all Unicode characters with that version" (well, most of it for me) BUT you need to configure the console: right click on the top of the windows (of the cmd or the python IDLE), in default/font choose the "Lucida console".
  • jfs
    jfs over 7 years
    @Guillaume: if you click the only link in the answer then you should see the more detail answer. The same comment applies (to address your comment).
  • michelek
    michelek about 7 years
    Thank you for the final line: "Generally, for apps using the C stdio, like Python does, getting non-ASCII characters to the Windows console is just too unreliable to bother with."
  • michelek
    michelek about 7 years
    If you ever meet in person - wish I could be present :)