How to find out Chinese or Japanese Character in a String in Python?

python string unicode utf-8 character-encoding

19,784

Solution 1

As a start, you can check if the character is in one of the following unicode blocks:

Unicode Block 'CJK Unified Ideographs' - U+4E00 to U+9FFF
Unicode Block 'CJK Unified Ideographs Extension A' - U+3400 to U+4DBF
Unicode Block 'CJK Unified Ideographs Extension B' - U+20000 to U+2A6DF
Unicode Block 'CJK Unified Ideographs Extension C' - U+2A700 to U+2B73F
Unicode Block 'CJK Unified Ideographs Extension D' - U+2B740 to U+2B81F

After that, all you need to do is iterate through the string, checking if the char is Chinese, Japanese or Korean (CJK) and append accordingly:

# -*- coding:utf-8 -*-
ranges = [
  {"from": ord(u"\u3300"), "to": ord(u"\u33ff")},         # compatibility ideographs
  {"from": ord(u"\ufe30"), "to": ord(u"\ufe4f")},         # compatibility ideographs
  {"from": ord(u"\uf900"), "to": ord(u"\ufaff")},         # compatibility ideographs
  {"from": ord(u"\U0002F800"), "to": ord(u"\U0002fa1f")}, # compatibility ideographs
  {'from': ord(u'\u3040'), 'to': ord(u'\u309f')},         # Japanese Hiragana
  {"from": ord(u"\u30a0"), "to": ord(u"\u30ff")},         # Japanese Katakana
  {"from": ord(u"\u2e80"), "to": ord(u"\u2eff")},         # cjk radicals supplement
  {"from": ord(u"\u4e00"), "to": ord(u"\u9fff")},
  {"from": ord(u"\u3400"), "to": ord(u"\u4dbf")},
  {"from": ord(u"\U00020000"), "to": ord(u"\U0002a6df")},
  {"from": ord(u"\U0002a700"), "to": ord(u"\U0002b73f")},
  {"from": ord(u"\U0002b740"), "to": ord(u"\U0002b81f")},
  {"from": ord(u"\U0002b820"), "to": ord(u"\U0002ceaf")}  # included as of Unicode 8.0
]

def is_cjk(char):
  return any([range["from"] <= ord(char) <= range["to"] for range in ranges])

def cjk_substrings(string):
  i = 0
  while i<len(string):
    if is_cjk(string[i]):
      start = i
      while is_cjk(string[i]): i += 1
      yield string[start:i]
    i += 1

string = "sdf344asfasf天地方益3権sdfsdf".decode("utf-8")
for sub in cjk_substrings(string):
  string = string.replace(sub, "(" + sub + ")")
print string

The above prints

sdf344asfasf(天地方益)3(権)sdfsdf

To be future-proof, you might want to keep a lookout for CJK Unified Ideographs Extension E. It will ship with Unicode 8.0, which is scheduled for release in June 2015. I've added it to the ranges, but you shouldn't include it until Unicode 8.0 is released.

[EDIT]

Added CJK compatibility ideographs, Japanese Kana and CJK radicals.

Solution 2

You can do the edit using the regex package, which supports checking the Unicode "Script" property of each character and is a drop-in replacement for the re package:

import regex as re

pattern = re.compile(r'([\p{IsHan}\p{IsBopo}\p{IsHira}\p{IsKatakana}]+)', re.UNICODE)

input = u'sdf344asfasf天地方益3権sdfsdf'
output = pattern.sub(r'(\1)', input)
print output  # Prints: sdf344asfasf(天地方益)3(権)sdfsdf

You should adjust the \p{Is...} sequences with the character scripts/blocks that you consider to be "Chinese or Japanese".

Solution 3

From one of the bleeding edge branch of NLTK inspired by the Moses Machine Translation Toolkit:

def is_cjk(character):
    """"
    Checks whether character is CJK.

        >>> is_cjk(u'\u33fe')
        True
        >>> is_cjk(u'\uFE5F')
        False

    :param character: The character that needs to be checked.
    :type character: char
    :return: bool
    """
    return any([start <= ord(character) <= end for start, end in 
                [(4352, 4607), (11904, 42191), (43072, 43135), (44032, 55215), 
                 (63744, 64255), (65072, 65103), (65381, 65500), 
                 (131072, 196607)]
                ])

For the specifics of the ord() numbers:

class CJKChars(object):
    """
    An object that enumerates the code points of the CJK characters as listed on
    http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane

    This is a Python port of the CJK code point enumerations of Moses tokenizer:
    https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl#L309
    """
    # Hangul Jamo (1100–11FF)
    Hangul_Jamo = (4352, 4607) # (ord(u"\u1100"), ord(u"\u11ff"))

    # CJK Radicals Supplement (2E80–2EFF)
    # Kangxi Radicals (2F00–2FDF)
    # Ideographic Description Characters (2FF0–2FFF)
    # CJK Symbols and Punctuation (3000–303F)
    # Hiragana (3040–309F)
    # Katakana (30A0–30FF)
    # Bopomofo (3100–312F)
    # Hangul Compatibility Jamo (3130–318F)
    # Kanbun (3190–319F)
    # Bopomofo Extended (31A0–31BF)
    # CJK Strokes (31C0–31EF)
    # Katakana Phonetic Extensions (31F0–31FF)
    # Enclosed CJK Letters and Months (3200–32FF)
    # CJK Compatibility (3300–33FF)
    # CJK Unified Ideographs Extension A (3400–4DBF)
    # Yijing Hexagram Symbols (4DC0–4DFF)
    # CJK Unified Ideographs (4E00–9FFF)
    # Yi Syllables (A000–A48F)
    # Yi Radicals (A490–A4CF)
    CJK_Radicals = (11904, 42191) # (ord(u"\u2e80"), ord(u"\ua4cf"))

    # Phags-pa (A840–A87F)
    Phags_Pa = (43072, 43135) # (ord(u"\ua840"), ord(u"\ua87f"))

    # Hangul Syllables (AC00–D7AF)
    Hangul_Syllables = (44032, 55215) # (ord(u"\uAC00"), ord(u"\uD7AF"))

    # CJK Compatibility Ideographs (F900–FAFF)
    CJK_Compatibility_Ideographs = (63744, 64255) # (ord(u"\uF900"), ord(u"\uFAFF"))

    # CJK Compatibility Forms (FE30–FE4F)
    CJK_Compatibility_Forms = (65072, 65103) # (ord(u"\uFE30"), ord(u"\uFE4F"))

    # Range U+FF65–FFDC encodes halfwidth forms, of Katakana and Hangul characters
    Katakana_Hangul_Halfwidth = (65381, 65500) # (ord(u"\uFF65"), ord(u"\uFFDC"))

    # Supplementary Ideographic Plane 20000–2FFFF
    Supplementary_Ideographic_Plane = (131072, 196607) # (ord(u"\U00020000"), ord(u"\U0002FFFF"))

    ranges = [Hangul_Jamo, CJK_Radicals, Phags_Pa, Hangul_Syllables, 
              CJK_Compatibility_Ideographs, CJK_Compatibility_Forms, 
              Katakana_Hangul_Halfwidth, Supplementary_Ideographic_Plane]

Combining the is_cjk() in this answer and @EvenLisle substring answer

>>> from nltk.tokenize.util import is_cjk
>>> text = u'sdf344asfasf天地方益3権sdfsdf'
>>> [1 if is_cjk(ch) else 0 for ch in text]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0]
>>> def cjk_substrings(string):
...     i = 0
...     while i<len(string):
...         if is_cjk(string[i]):
...             start = i
...             while is_cjk(string[i]): i += 1
...             yield string[start:i]
...         i += 1
... 
>>> string = "sdf344asfasf天地方益3権sdfsdf".decode("utf-8")
>>> for sub in cjk_substrings(string):
...     string = string.replace(sub, "(" + sub + ")")
... 
>>> string
u'sdf344asfasf(\u5929\u5730\u65b9\u76ca)3(\u6a29)sdfsdf'
>>> print string
sdf344asfasf(天地方益)3(権)sdfsdf

Solution 4

If you can't use regex module that provides access to IsKatakana, IsHan properties as shown in @一二三's answer; you could use character ranges from @EvenLisle's answer with stdlib's re module:

>>> import re
>>> print(re.sub(u"([\u3300-\u33ff\ufe30-\ufe4f\uf900-\ufaff\U0002f800-\U0002fa1f\u30a0-\u30ff\u2e80-\u2eff\u4e00-\u9fff\u3400-\u4dbf\U00020000-\U0002a6df\U0002a700-\U0002b73f\U0002b740-\U0002b81f\U0002b820-\U0002ceaf]+)", r"(\1)", u'sdf344asfasf天地方益3権sdfsdf'))
sdf344asfasf(天地方益)3(権)sdfsdf

Beware of known issues.

You could also check Unicode category:

>>> import unicodedata
>>> unicodedata.category(u'天')
'Lo'
>>> unicodedata.category(u's')
'Ll'

View more solutions

19,784

Author by

Sam

Updated on June 15, 2022

Comments

Sam almost 2 years

Such as:

str = 'sdf344asfasf天地方益3権sdfsdf'

Add () to Chinese and Japanese Characters:

strAfterConvert = 'sdfasfasf(天地方益)3(権)sdfsdf'

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError?

How to replace invalid unicode characters in a string in Python?

Python string decoding issue

python bytes(some_string, 'UTF-8') and str(some_string, 'UTF-8')

How do I detect if a file is encoded using UTF-8?

c++ how to write/read ofstream in unicode / utf8

Strings and character encoding in C++

Test a string if it's Unicode, which UTF standard is and get its length in bytes?

Bytes in a unicode Python string

What is the difference between encode/decode?