How to convert UTF-8 to US-Ascii in Java

57,608

Solution 1

The uni2ascii program is written in C, but you could probably convert it to Java with little effort. It contains a large table of approximations (implicitly, in the switch-case statements).

Be aware that there are no universally accepted approximations: Germans want you to replace Ä by AE, Finns and Swedes prefer just A. Your example of Å isn't obvious either: Swedes would probably just drop the ring and use A, but Danes and Norwegians might like the historically more correct AA better.

Solution 2

You can do this with the following (from the NFD example in this Core Java Technology Tech Tip):

public static String decompose(String s) {
    return java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+","");
}

Solution 3

Instead of creating your own table, you could instead convert the text to normalization form D, where the characters are represented as a base character plus the diacritics (for instance, "á" will be replaced by "a" followed by a combining acute accent). You can then strip everything which is not an ASCII letter.

The tables still exist, but are now the ones from the Unicode standard.

You could also try NFKD instead of NFD, to catch even more cases.

References:

Solution 4

In response to the answer given by Joe Liversedge, the referenced Lucene ISOLatin1AccentFilter no longer exists :

It has been replaced by org.apache.lucene.analysis.ASCIIFoldingFilter :

This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. Characters from the following Unicode blocks are converted; however, only those characters with reasonable ASCII alternatives are converted.

FYI -

Solution 5

This is typically useful in search applications. See the corresponding Lucene ISOLatin1AccentFilter implementation. This isn't really designed for plugging into a random local implementation, but does the trick.

Share:
57,608
Ulf Lindback
Author by

Ulf Lindback

Updated on December 29, 2020

Comments

  • Ulf Lindback
    Ulf Lindback over 3 years

    We have a system where customers, mainly European enter texts (in UTF-8) that has to be distributed to different systems, most of them accepting UTF-8, but now we must also distribute the texts to a US system which only accepts US-Ascii 7-bit

    So now we'll need to translate all European characters to the nearest US-Ascii. Is there any Java libraries to help with this task?

    Right now we've just started adding to a translation table, where Å (swedish AA)->A and so on and where we don't find any match for an entered character, we'll log it and replace with a question mark and try and fix that for the next release, but it seems very inefficient and somebody else must have done something similair before.