Replacing unicode punctuation with ASCII approximations
Solution 1
Each unicode character is assigned a category. There exists two separate categories for quotes:
- Punctuation, Final quote (may behave like Ps or Pe depending on usage)
- Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
With these lists, you should be able to handle all quotes appropriately, if you would like to code the regex manually.
Java Character.getType gives you the category of character, for example FINAL_QUOTE_PUNCTUATION
.
Now you can get the category of each (punctuation-)character and replace it with an appropriate supplement in ASCII.
You can use the other punctuation categories accordingly. In 'Punctuation, Other' there are some characters, for example PRIME ′
, which you may also want to substitute with an apostrophe.
Solution 2
I found a pretty extensive table that maps Unicode punctuation to their closest ASCII equivalents.
Here's more info: Map Symbols & Punctuation to ASCII.
Solution 3
I followed @marek-stoj's link and created a Scala application that cleans unicode out of strings while maintaining the string length. It remove diacritics (accents) and uses the map suggested by @marek-stoj to convert non-Ascii unicode characters to their ascii approximations.
import java.text.Normalizer
object Asciifier {
def apply(string: String) = {
var cleaned = string
for ((unicode, ascii) <- substitutions) {
cleaned = cleaned.replaceAll(unicode, ascii)
}
// convert diacritics to a two-character form (NFD)
// http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html
cleaned = Normalizer.normalize(cleaned, Normalizer.Form.NFD)
// remove all characters that combine with the previous character
// to form a diacritic. Also remove control characters.
// http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
cleaned.replaceAll("[\\p{InCombiningDiacriticalMarks}\\p{Cntrl}]", "")
// size must not change
require(cleaned.size == string.size)
cleaned
}
val substitutions = Set(
(0x00AB, '"'),
(0x00AD, '-'),
(0x00B4, '\''),
(0x00BB, '"'),
(0x00F7, '/'),
(0x01C0, '|'),
(0x01C3, '!'),
(0x02B9, '\''),
(0x02BA, '"'),
(0x02BC, '\''),
(0x02C4, '^'),
(0x02C6, '^'),
(0x02C8, '\''),
(0x02CB, '`'),
(0x02CD, '_'),
(0x02DC, '~'),
(0x0300, '`'),
(0x0301, '\''),
(0x0302, '^'),
(0x0303, '~'),
(0x030B, '"'),
(0x030E, '"'),
(0x0331, '_'),
(0x0332, '_'),
(0x0338, '/'),
(0x0589, ':'),
(0x05C0, '|'),
(0x05C3, ':'),
(0x066A, '%'),
(0x066D, '*'),
(0x200B, ' '),
(0x2010, '-'),
(0x2011, '-'),
(0x2012, '-'),
(0x2013, '-'),
(0x2014, '-'),
(0x2015, '-'),
(0x2016, '|'),
(0x2017, '_'),
(0x2018, '\''),
(0x2019, '\''),
(0x201A, ','),
(0x201B, '\''),
(0x201C, '"'),
(0x201D, '"'),
(0x201E, '"'),
(0x201F, '"'),
(0x2032, '\''),
(0x2033, '"'),
(0x2034, '\''),
(0x2035, '`'),
(0x2036, '"'),
(0x2037, '\''),
(0x2038, '^'),
(0x2039, '<'),
(0x203A, '>'),
(0x203D, '?'),
(0x2044, '/'),
(0x204E, '*'),
(0x2052, '%'),
(0x2053, '~'),
(0x2060, ' '),
(0x20E5, '\\'),
(0x2212, '-'),
(0x2215, '/'),
(0x2216, '\\'),
(0x2217, '*'),
(0x2223, '|'),
(0x2236, ':'),
(0x223C, '~'),
(0x2264, '<'),
(0x2265, '>'),
(0x2266, '<'),
(0x2267, '>'),
(0x2303, '^'),
(0x2329, '<'),
(0x232A, '>'),
(0x266F, '#'),
(0x2731, '*'),
(0x2758, '|'),
(0x2762, '!'),
(0x27E6, '['),
(0x27E8, '<'),
(0x27E9, '>'),
(0x2983, '{'),
(0x2984, '}'),
(0x3003, '"'),
(0x3008, '<'),
(0x3009, '>'),
(0x301B, ']'),
(0x301C, '~'),
(0x301D, '"'),
(0x301E, '"'),
(0xFEFF, ' ')).map { case (unicode, ascii) => (unicode.toChar.toString, ascii.toString) }
}
Solution 4
While this does not exactly answers your question, you can convert your Unicode text to US-ASCII replacing non-ASCII characters with '?' symbols.
String input = "aáeéiíoóuú"; // 10 chars.
Charset ch = Charset.forName("US-ASCII");
CharsetEncoder enc = ch.newEncoder();
enc.onUnmappableCharacter(CodingErrorAction.REPLACE);
enc.replaceWith(new byte[]{'?'});
ByteBuffer out = null;
try {
out = enc.encode(CharBuffer.wrap(input));
} catch (CharacterCodingException e) {
/* ignored, shouldn't happen */
}
String outStr = ch.decode(out).toString();
// Prints "a?e?i?o?u?"
System.out.println(outStr);
Solution 5
Here's a Python package that does a good job. It's based on a Perl module Text::Unidecode. I assume this could be ported to Java.
http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/
Related videos on Youtube
Comments
-
schmmd almost 2 years
I am reading some text files in a Java program and would like to replace some Unicode characters with ASCII approximations. These files will eventually be broken into sentences that are fed to OpenNLP. OpenNLP does not recognize Unicode characters and gives improper results on a number of symbols (it tokenizes "girl's" as "girl" and "'s" but if it is a Unicode quote it is treated as a single token)..
For example, the source sentence may contain the Unicode directional quotation U2018 (‘) and I would like to convert that to U0027 ('). Eventually I will be stripping the remaining Unicode.
I understand that I am losing information, and I know that I could write regular expressions to convert each of these symbols, but I am asking if there is code I can reuse to convert some of these symbols.
This is what I could, but I'm sure I will make mistakes/miss things/etc.:
// double quotation (") replacements.add(new Replacement(Pattern.compile("[\u201c\u201d\u201e\u201f\u275d\u275e]"), "\"")); // single quotation (') replacements.add(new Replacement(Pattern.compile("[\u2018\u2019\u201a\u201b\u275b\u275c]"), "'"));
replacements is a custom class that I later run over and apply the replacements.
for (Replacement replacement : replacements) { text = replacement.pattern.matcher(text).replaceAll(r.replacement); }
As you can see, I had to find:
- LEFT SINGLE QUOTATION MARK
- RIGHT SINGLE QUOTATION MARK
- SINGLE LOW-9 QUOTATION MARK (what is this/should I replace this?)
- SINGLE HIGH-REVERSED-9 QUOTATION MARK (what is this/should I replace this?)
-
Mu Mind over 13 yearsAre you looking for a library and/or example code in a particular language? Or are you looking for a pre-existing mapping of Unicode characters onto ASCII approximations? I'm not sure what the difference is between a regex and code you can reuse.
-
schmmd over 13 yearsI am looking for a Java library. I can write regular expressions, but I'm sure I will miss something in the process. I was wondering if someone else has already made decisions for me. Have you been reading GEB, Mu Mind?
-
user833970 about 10 yearsthose unicode links are dead
-
schmmd over 13 yearsI remove diacritics with Normalizer.normalize(text, Normalizer.Form.NFD) followed by a replace with Pattern.compile("\\p{InCombiningDiacriticalMarks}+").
-
Triynko about 13 yearsI'm resorting to just using a custom map, with as many characters as I can define, because the Unicode categories assigned to basic characters seem inadequate. For example, the basic single and double quote characters (the ones you type into notepad using your keyboard for example) are categorized as "Punctuation Other", rather than the Punctuation Initial and Punctuation Final categories that you'd expect them to be categorized under.
-
Triynko about 13 yearsWith this solution, basic punctuation marks like quotes that ought to be mapped are not mapped to the ASCII quote. Many other Unicode characters that you would say "this is basically the same thing as this ASCII character" will not get mapped properly. Therefore, I think that using a custom map with all reasonable replacements would achieve better results.
-
Stephen P about 13 years@Triynko - the problem there is: there is only one "normal" (ASCII) single quote and one double quote, so marking it as either
INITIAL
orFINAL
quote punctuation would also be wrong. -
Dirk Groeneveld over 9 yearsI translated that list to Scala and put it here: gist.github.com/dirkgr/6349f379740880209475
-
Dirk Groeneveld over 9 years@schmmd has a more comprehensive version below.
-
slawek almost 9 yearsYou have a bug:
replaceAll
doesn't mutate string. You need to assign result ofreplaceAll
back to cleaned.