Chinese and Japanese character support in python

49,048

Solution 1

Please do read the Python Unicode HOWTO; it explains how to process and include non-ASCII text in your Python code.

If you want to include Japanese text literals in your code, you have several options:

  • Use unicode literals (create unicode objects instead of byte strings), but any non-ascii codepoint is represented by a unicode escape character. They take the form of \uabcd, so a backslash, a u and 4 hexadecimal digits:

    ru = u'\u30EB'
    

    would be one character, the katakana 'ru' codepoint ('ル').

  • Use unicode literals, but include the characters in some form of encoding. Your text editor will save files in a given encoding (say, UTF-16); you need to declare that encoding at the top of the source file:

    # encoding: utf-16
    
    ru = u'ル'
    

    where 'ル' is included without using an escape. The default encoding for Python 2 files is ASCII, so by declaring an encoding you make it possible to use Japanese directly.

  • Use byte string literals, ready encoded. Encode the codepoints by some other means and include them in your byte string literals. If all you are going to do with them is use them in encoded form anyway, this should be fine:

    ru = '\xeb\x30'  # ru encoded to UTF16 little-endian
    

    I encoded 'ル' to UTF-16 little-endian because that's the default Windows NTFS filename encoding.

Next problem will be your terminal, the Windows console is notorious for not supporting many character sets out of the box. You probably want to configure it to handle UTF-8 instead. See this question for some details, but you need to run the following command in the console:

chcp 65001

to switch to UTF-8, and you may need to switch to a console font that can handle your codepoints (Lucida perhaps?).

Solution 2

There are two independent issues:

  1. You should specify Python source encoding if you use non-ascii characters and use Unicode literals for data that represents text e.g.:

    # -*- coding: utf-8 -*-
    path = ur"E:\Test\は最高のプログラマ"
    
  2. Printing Unicode to Windows console is complicated but if you set correct font then just:

    print path
    

    might work.

Regardless of whether your console can display the path; it should be fine to pass the Unicode path to filesystem functions e.g.:

entries = os.listdir(path)

Don't call .encode(char_enc) on bytestrings, call it on Unicode strings instead.
Don't call .decode(char_enc) on Unicode strings, call it on bytestrings instead.

Solution 3

You should force the string to be a unicode object like

path = ur"E:\Test\は最高のプログラマ"

Docs on string literals relevant to 2.5 are located here

Edit: I'm not positive on if the object is a unicode in 2.5 but the docs do state that \uXXXX[XXXX] will be processed and the the string will be "a Unicode string".

Share:
49,048
user2030113
Author by

user2030113

Updated on February 04, 2020

Comments

  • user2030113
    user2030113 over 4 years

    How to read correctly japanese and chinese characters. I'm using python 2.5. Output is displayed as "E:\Test\?????????"

    path = r"E:\Test\は最高のプログラマ"
    t = path.encode()
    print t
    u = path.decode()
    print u
    t = path.encode("utf-8")
    print t
    t = path.decode("utf-8")
    print t
    
  • Martijn Pieters
    Martijn Pieters over 11 years
    Python 2.5 supports that fine, but unicode objects are only a small part of the picture.
  • jfs
    jfs over 11 years
  • Martijn Pieters
    Martijn Pieters over 11 years
    @J.F.Sebastian: Yeah, I have seen hints about the problem here and there, including the SO question I linked to. Thanks for that bug link, that's good to have.