Python and character normalization

11,896

Solution 1

I recommend using Unidecode module:

>>> from unidecode import unidecode
>>> unidecode(u'ıöüç')
'iouc'

Note how you feed it a unicode string and it outputs a byte string. The output is guaranteed to be ASCII.

Solution 2

It all depends on how far you want to go in transliterating the result. If you want to convert everything all the way to ASCII (αβγ to abg) then unidecode is the way to go.

If you just want to remove accents from accented letters, then you could try decomposing your string using normalization form NFKD (this converts the accented letter á to a plain letter a followed by U+0301 COMBINING ACUTE ACCENT) and then discarding the accents (which belong to the Unicode character class Mn — "Mark, nonspacing").

import unicodedata

def remove_nonspacing_marks(s):
    "Decompose the unicode string s and remove non-spacing marks."
    return ''.join(c for c in unicodedata.normalize('NFKD', s)
                   if unicodedata.category(c) != 'Mn')

Solution 3

The simplest way I found:

unicodedata.normalize('NFKD', s).encode("ascii", "ignore")

Share:
11,896

Related videos on Youtube

Hellnar
Author by

Hellnar

Updated on April 13, 2020

Comments

  • Hellnar
    Hellnar about 4 years

    Hello I retrieve text based utf8 data from a foreign source which contains special chars such as u"ıöüç" while I want to normalize them to English such as "ıöüç" -> "iouc" . What would be the best way to achieve this ?

    • soulseekah
      soulseekah over 13 years
      Search and replace maybe, you'll have to have a full array of special to normal characters though.
  • Hellnar
    Hellnar over 13 years
    I am abit confused: I have tried print unicodedata.normalize('NFKD', u"ıöüç") and the output is again: ıöüç
  • soulseekah
    soulseekah over 13 years
    Try the other valid values ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
  • drxzcl
    drxzcl over 13 years
    I'm fairly certain this is a different operation than what the asker is looking for. But yes, this is unicode normalization. What he is asking is not usually referred as such.
  • soulseekah
    soulseekah over 13 years
    Right. It's referred to as transliteration I think.