How can I detect japanese text in a Java string?

java unicode character-encoding

11,223

Solution 1

I use the following java method. Might not completely address your requirement though.

<!-- language: lang-java -->
/**
 * Returns if a character is one of Chinese-Japanese-Korean characters.
 * 
 * @param c
 *            the character to be tested
 * @return true if CJK, false otherwise
 */
private boolean isCharCJK(final char c) {
    if ((Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_RADICALS_SUPPLEMENT)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.ENCLOSED_CJK_LETTERS_AND_MONTHS)) {
        return true;
    }
    return false;
}

Futhermore, these seem they should work for Hiragana and Katakana characters:

private boolean isHiragana(final char c)
{
     return (Character.UnicodeBlock.of(c)==Character.UnicodeBlock.HIRAGANA);
}

private boolean isKatakana(final char c)
{
     return (Character.UnicodeBlock.of(c)==Character.UnicodeBlock.KATAKANA);
}

Solution 2

According regular-expressions.info, Japanese isn't made of one script: "There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han and Latin scripts that Japanese documents are usually composed of."

In which case, this regex should do the trick:

yourString.matches("[\\p{Hiragana}\\p{Katakana}\\p{Han}\\p{Latin}]*+")

11,223

Author by

David G

I am a experienced multi-platform senior software engineer specializing in RPG on IBM i and Java. In addition I dabble in PHP, JavaScript, and a (very) little bit of Python. I created and operate midrange.com one of the oldest, and most popular, online forums for IBM i (System i / iSeries / AS400) professionals. I'm also an active volunteer in the Chicago area Tour de Cure, a fundraising event for the American Diabetes Association. You can sponsor my ride by visiting Diabetes Sucks!. You can see a map of where my donations come from on my interactive donation map.

Updated on June 07, 2022

Comments

David G almost 2 years

I need to be able to detect Japanese characters in a Java string.

Currently I'm getting the UnicodeBlock and checking to see if it's equal to Character.UnicodeBlock.KATAKANA or Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS, but I'm not 100% that's going to cover everything.

Any suggestions?
David G over 14 years

Sorry, I wasn't precise enough ... I want to detect Japanese CHARACTERS in a string, not the character set name.
Kathy Van Stone over 14 years

Including Latin will match most European languages as well, which I don't think is what the OP wants to check for (although Japanese is sometimes written with Roman characters as well).
Kathy Van Stone over 14 years

Han are Chinese characters as well, but I believe you do want to add Hiragana.
Igor Mironenko over 13 years

That's right, there's no way to really know. This character in a string 本 - could be part of chinese or japanese text. And it's neither hiragana nor katakana of any width.
Jiechao Wang almost 6 years

This seems to fail to detect some Japanese and Korean characters. I ended up combining this with gist.github.com/TheFinestArtist/2fd1b4aa1d4824fcbaef