Replacing unicode punctuation with ASCII approximations

15,793

Solution 1

Each unicode character is assigned a category. There exists two separate categories for quotes:

With these lists, you should be able to handle all quotes appropriately, if you would like to code the regex manually.

Java Character.getType gives you the category of character, for example FINAL_QUOTE_PUNCTUATION.

Now you can get the category of each (punctuation-)character and replace it with an appropriate supplement in ASCII.

You can use the other punctuation categories accordingly. In 'Punctuation, Other' there are some characters, for example PRIME , which you may also want to substitute with an apostrophe.

Solution 2

I found a pretty extensive table that maps Unicode punctuation to their closest ASCII equivalents.

Here's more info: Map Symbols & Punctuation to ASCII.

Solution 3

I followed @marek-stoj's link and created a Scala application that cleans unicode out of strings while maintaining the string length. It remove diacritics (accents) and uses the map suggested by @marek-stoj to convert non-Ascii unicode characters to their ascii approximations.

import java.text.Normalizer

object Asciifier {
  def apply(string: String) = {
    var cleaned = string
      for ((unicode, ascii) <- substitutions) {
        cleaned = cleaned.replaceAll(unicode, ascii)
      }

    // convert diacritics to a two-character form (NFD)
    // http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html
    cleaned = Normalizer.normalize(cleaned, Normalizer.Form.NFD)

    // remove all characters that combine with the previous character
    // to form a diacritic.  Also remove control characters.
    // http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
    cleaned.replaceAll("[\\p{InCombiningDiacriticalMarks}\\p{Cntrl}]", "")

    // size must not change
    require(cleaned.size == string.size)

    cleaned
  }

  val substitutions = Set(
      (0x00AB, '"'),
      (0x00AD, '-'),
      (0x00B4, '\''),
      (0x00BB, '"'),
      (0x00F7, '/'),
      (0x01C0, '|'),
      (0x01C3, '!'),
      (0x02B9, '\''),
      (0x02BA, '"'),
      (0x02BC, '\''),
      (0x02C4, '^'),
      (0x02C6, '^'),
      (0x02C8, '\''),
      (0x02CB, '`'),
      (0x02CD, '_'),
      (0x02DC, '~'),
      (0x0300, '`'),
      (0x0301, '\''),
      (0x0302, '^'),
      (0x0303, '~'),
      (0x030B, '"'),
      (0x030E, '"'),
      (0x0331, '_'),
      (0x0332, '_'),
      (0x0338, '/'),
      (0x0589, ':'),
      (0x05C0, '|'),
      (0x05C3, ':'),
      (0x066A, '%'),
      (0x066D, '*'),
      (0x200B, ' '),
      (0x2010, '-'),
      (0x2011, '-'),
      (0x2012, '-'),
      (0x2013, '-'),
      (0x2014, '-'),
      (0x2015, '-'),
      (0x2016, '|'),
      (0x2017, '_'),
      (0x2018, '\''),
      (0x2019, '\''),
      (0x201A, ','),
      (0x201B, '\''),
      (0x201C, '"'),
      (0x201D, '"'),
      (0x201E, '"'),
      (0x201F, '"'),
      (0x2032, '\''),
      (0x2033, '"'),
      (0x2034, '\''),
      (0x2035, '`'),
      (0x2036, '"'),
      (0x2037, '\''),
      (0x2038, '^'),
      (0x2039, '<'),
      (0x203A, '>'),
      (0x203D, '?'),
      (0x2044, '/'),
      (0x204E, '*'),
      (0x2052, '%'),
      (0x2053, '~'),
      (0x2060, ' '),
      (0x20E5, '\\'),
      (0x2212, '-'),
      (0x2215, '/'),
      (0x2216, '\\'),
      (0x2217, '*'),
      (0x2223, '|'),
      (0x2236, ':'),
      (0x223C, '~'),
      (0x2264, '<'),
      (0x2265, '>'),
      (0x2266, '<'),
      (0x2267, '>'),
      (0x2303, '^'),
      (0x2329, '<'),
      (0x232A, '>'),
      (0x266F, '#'),
      (0x2731, '*'),
      (0x2758, '|'),
      (0x2762, '!'),
      (0x27E6, '['),
      (0x27E8, '<'),
      (0x27E9, '>'),
      (0x2983, '{'),
      (0x2984, '}'),
      (0x3003, '"'),
      (0x3008, '<'),
      (0x3009, '>'),
      (0x301B, ']'),
      (0x301C, '~'),
      (0x301D, '"'),
      (0x301E, '"'),
      (0xFEFF, ' ')).map { case (unicode, ascii) => (unicode.toChar.toString, ascii.toString) }
}

Solution 4

While this does not exactly answers your question, you can convert your Unicode text to US-ASCII replacing non-ASCII characters with '?' symbols.

String input = "aáeéiíoóuú"; // 10 chars.

Charset ch = Charset.forName("US-ASCII");
CharsetEncoder enc = ch.newEncoder();
enc.onUnmappableCharacter(CodingErrorAction.REPLACE);
enc.replaceWith(new byte[]{'?'});

ByteBuffer out = null;

try {
    out = enc.encode(CharBuffer.wrap(input));
} catch (CharacterCodingException e) { 
    /* ignored, shouldn't happen */ 
}

String outStr = ch.decode(out).toString();

// Prints "a?e?i?o?u?"
System.out.println(outStr);

Solution 5

Here's a Python package that does a good job. It's based on a Perl module Text::Unidecode. I assume this could be ported to Java.

http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

http://pypi.python.org/pypi/Unidecode

Share:
15,793

Related videos on Youtube

schmmd
Author by

schmmd

Scala cowboy.

Updated on May 17, 2022

Comments

  • schmmd
    schmmd almost 2 years

    I am reading some text files in a Java program and would like to replace some Unicode characters with ASCII approximations. These files will eventually be broken into sentences that are fed to OpenNLP. OpenNLP does not recognize Unicode characters and gives improper results on a number of symbols (it tokenizes "girl's" as "girl" and "'s" but if it is a Unicode quote it is treated as a single token)..

    For example, the source sentence may contain the Unicode directional quotation U2018 (‘) and I would like to convert that to U0027 ('). Eventually I will be stripping the remaining Unicode.

    I understand that I am losing information, and I know that I could write regular expressions to convert each of these symbols, but I am asking if there is code I can reuse to convert some of these symbols.

    This is what I could, but I'm sure I will make mistakes/miss things/etc.:

        // double quotation (")
        replacements.add(new Replacement(Pattern.compile("[\u201c\u201d\u201e\u201f\u275d\u275e]"), "\""));
    
        // single quotation (')
        replacements.add(new Replacement(Pattern.compile("[\u2018\u2019\u201a\u201b\u275b\u275c]"), "'"));
    

    replacements is a custom class that I later run over and apply the replacements.

        for (Replacement replacement : replacements) {
             text = replacement.pattern.matcher(text).replaceAll(r.replacement);
        }
    

    As you can see, I had to find:

    • LEFT SINGLE QUOTATION MARK
    • RIGHT SINGLE QUOTATION MARK
    • SINGLE LOW-9 QUOTATION MARK (what is this/should I replace this?)
    • SINGLE HIGH-REVERSED-9 QUOTATION MARK (what is this/should I replace this?)
    • Mu Mind
      Mu Mind over 13 years
      Are you looking for a library and/or example code in a particular language? Or are you looking for a pre-existing mapping of Unicode characters onto ASCII approximations? I'm not sure what the difference is between a regex and code you can reuse.
    • schmmd
      schmmd over 13 years
      I am looking for a Java library. I can write regular expressions, but I'm sure I will miss something in the process. I was wondering if someone else has already made decisions for me. Have you been reading GEB, Mu Mind?
    • user833970
      user833970 about 10 years
      those unicode links are dead
  • schmmd
    schmmd over 13 years
    I remove diacritics with Normalizer.normalize(text, Normalizer.Form.NFD) followed by a replace with Pattern.compile("\\p{InCombiningDiacriticalMarks}+").
  • Triynko
    Triynko about 13 years
    I'm resorting to just using a custom map, with as many characters as I can define, because the Unicode categories assigned to basic characters seem inadequate. For example, the basic single and double quote characters (the ones you type into notepad using your keyboard for example) are categorized as "Punctuation Other", rather than the Punctuation Initial and Punctuation Final categories that you'd expect them to be categorized under.
  • Triynko
    Triynko about 13 years
    With this solution, basic punctuation marks like quotes that ought to be mapped are not mapped to the ASCII quote. Many other Unicode characters that you would say "this is basically the same thing as this ASCII character" will not get mapped properly. Therefore, I think that using a custom map with all reasonable replacements would achieve better results.
  • Stephen P
    Stephen P about 13 years
    @Triynko - the problem there is: there is only one "normal" (ASCII) single quote and one double quote, so marking it as either INITIAL or FINAL quote punctuation would also be wrong.
  • Dirk Groeneveld
    Dirk Groeneveld over 9 years
    I translated that list to Scala and put it here: gist.github.com/dirkgr/6349f379740880209475
  • Dirk Groeneveld
    Dirk Groeneveld over 9 years
    @schmmd has a more comprehensive version below.
  • slawek
    slawek almost 9 years
    You have a bug: replaceAll doesn't mutate string. You need to assign result of replaceAll back to cleaned.