Java regex for support Unicode?

79,583

Solution 1

What you are looking for are Unicode properties.

e.g. \p{L} is any kind of letter from any language

So a regex to match such a Chinese word could be something like

\p{L}+

There are many such properties, for more details see regular-expressions.info

Another option is to use the modifier

Pattern.UNICODE_CHARACTER_CLASS

In Java 7 there is a new property Pattern.UNICODE_CHARACTER_CLASS that enables the Unicode version of the predefined character classes see my answer here for some more details and links

You could do something like this

Pattern p = Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS);

and \w would match all letters and all digits from any languages (and of course some word combining characters like _).

Solution 2

To address NLS support and avoid accepting English special character, we can use below pattern...

[a-zA-Z0-9 \u0080-\u9fff]*+

For UTF code point reference: http://www.utf8-chartable.de/unicode-utf8-table.pl

Code snippet:

    String vowels = "అఆఇఈఉఊఋఌఎఏఐఒఓఔౠౡ";
    String consonants = "కఖగఘఙచఛజఝఞటఠడఢణతథదధనపఫబభమయరఱలళవశషసహ";
    String signsAndPunctuations = "కఁకంకఃకాకికీకుకూకృకౄకెకేకైకొకోకౌక్కౕకౖ";
    String symbolsAndNumerals = "౦౧౨౩౪౫౬౭౮౯";
    String engChinesStr = "ABC導字會";


    Pattern ALPHANUMERIC_AND_SPACE_PATTERN_TELUGU = Pattern
            .compile("[a-zA-Z0-9 \\u0c00-\\u0c7f]*+");
    System.out.println(ALPHANUMERIC_AND_SPACE_PATTERN_TELUGU.matcher(vowels)
            .matches());


    Pattern ALPHANUMERIC_AND_SPACE_PATTERN_CHINESE = Pattern
            .compile("[a-zA-Z0-9 \\u4e00-\\u9fff]*+");

    Pattern ENGLISH_ALPHANUMERIC_SPACE_AND_NLS_PATTERN = Pattern
            .compile("[a-zA-Z0-9 \\u0080-\\u9fff]*+");

    System.out.println(ENGLISH_ALPHANUMERIC_SPACE_AND_NLS_PATTERN.matcher(engChinesStr)
            .matches());

Solution 3

To match individual characters, you can simply include them in an a character class, either as literals or via the \u03FB syntax.

Obviously you often cannot list all allowed characters in ideographic languages. To make the regex treat unicode characters according to their type or code block, various other escapes are supported that are defined here. Look at the section "Unicode support", particularly the references to the Character class and to the Unicode Standard itself.

Solution 4

  • the Java regular expression API works on the char type
  • the char type is implicitly UTF-16
  • if you have UTF-8 data you will need to transcode it to UTF-16 on input if this is not already being done

Unicode is the universal set of characters and UTF-8 can describe all of it (including control characters, punctuation, symbols, letters, etc.) You will have to be more specific about what you want to include and what you want to exclude. Java regular expressions uses the \p{category} syntax to match codepoints by category. See the Unicode standard for the list of categories.

If you want to identify and separate words in a sequence of ideographs, you will need to look at a more sophisticated API. I would start with the BreakIterator type.

Share:
79,583
cometta
Author by

cometta

Updated on March 09, 2020

Comments

  • cometta
    cometta about 4 years

    To match A to Z, we will use regex:

    [A-Za-z]

    How to allow regex to match utf8 characters entered by user? For example Chinese words like 环保部

  • cometta
    cometta almost 12 years
    how to mataches multiple utf8 characters entered by user example 环保部, because user will be entering randomly number of characters
  • Kilian Foth
    Kilian Foth almost 12 years
    It's just like matching multiple Latin characters: [a-z]+ or [a-z]{3} or even [a-z]{2,10}. The only thing different is what you allow in the character class that the quantifier applies to.
  • Dave Jarvis
    Dave Jarvis over 3 years
    To match words like Da̱nx̱a̱laga̱litła̱n, do we need to instruct the pattern matcher to combine the diacritics?