Convert between Unicode Normalization Forms on the unix command-line

8,831

Solution 1

You can use the uconv utility from ICU. Normalization is achieved through transliteration (-x).

$ uconv -x any-nfd <<<ä | hd
00000000  61 cc 88 0a                                       |a...|
00000004
$ uconv -x any-nfc <<<ä | hd
00000000  c3 a4 0a                                          |...|
00000003

On Debian, Ubuntu and other derivatives, uconv is in the libicu-dev package. On Fedora, Red Hat and other derivatives, and in BSD ports, it's in the icu package.

Solution 2

Python has unicodedata module in its standard library, which allow to translate Unicode representations through unicodedata.normalize() function:

import unicodedata

s1 = 'Spicy Jalape\u00f1o'
s2 = 'Spicy Jalapen\u0303o'

t1 = unicodedata.normalize('NFC', s1)
t2 = unicodedata.normalize('NFC', s2)
print(t1 == t2) 
print(ascii(t1)) 

t3 = unicodedata.normalize('NFD', s1)
t4 = unicodedata.normalize('NFD', s2)
print(t3 == t4)
print(ascii(t3))

Running with Python 3.x:

$ python3 test.py
True
'Spicy Jalape\xf1o'
True
'Spicy Jalapen\u0303o'

Python isn't well suited for shell one liners, but it can be done if you don't want to create external script:

$ python3 -c $'import unicodedata\nprint(unicodedata.normalize("NFC", "ääääää"))'
ääääää

For Python 2.x you have to add encoding line (# -*- coding: utf-8 -*-) and mark strings as Unicode with u character:

$ python -c $'# -*- coding: utf-8 -*-\nimport unicodedata\nprint(unicodedata.normalize("NFC", u"ääääää"))'
ääääää

Solution 3

For completeness, with perl:

$ perl -CSA -MUnicode::Normalize=NFD -e 'print NFD($_) for @ARGV' $'\ue1' | uconv -x name
\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}
$ perl -CSA -MUnicode::Normalize=NFC -e 'print NFC($_) for @ARGV' $'a\u301' | uconv -x name
\N{LATIN SMALL LETTER A WITH ACUTE}

Solution 4

Check it with the tool hexdump:

echo  -e "ä\c" |hexdump -C 

00000000  61 cc 88                                          |a..|
00000003  

convert with iconv and check again with hexdump:

echo -e "ä\c" | iconv -f UTF-8-MAC -t UTF-8 |hexdump -C

00000000  c3 a4                                             |..|
00000002

printf '\xc3\xa4'
ä

Solution 5

There's a perl utility called Charlint available from

https://www.w3.org/International/charlint/

which does what you want. You'll also have to download a file from

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

After the first run you'll see Charlint complaining about incompatible entries in that file so you'll have to delete those lines from UnicodeData.txt.

Share:
8,831

Related videos on Youtube

glts
Author by

glts

Updated on September 18, 2022

Comments

  • glts
    glts over 1 year

    In Unicode, some character combinations have more than one representation.

    For example, the character ä can be represented as

    • "ä", that is the codepoint U+00E4 (two bytes c3 a4 in UTF-8 encoding), or as
    • "ä", that is the two codepoints U+0061 U+0308 (three bytes 61 cc 88 in UTF-8).

    According to the Unicode standard, the two representations are equivalent but in different "normalization forms", see UAX #15: Unicode Normalization Forms.

    The unix toolbox has all kinds of text transformation tools, sed, tr, iconv, Perl come to mind. How can I do quick and easy NF conversion on the command-line?

  • glts
    glts over 10 years
    This works, thanks. You have to install a 30M dev library alongside it though. What's worse, I haven't been able to find proper documentation for uconv itself: where did you find any-nfd? It looks like development of this tool has been abandoned, last update was in 2005.
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' over 10 years
    @glts I found any-nfd by browsing through the list displayed by uconv -L.
  • antonone
    antonone over 7 years
    This only works on macOS. There is no 'utf-8-mac' on Linux, on FreeBSDs, etc. Also, decomposition by using this encoding does not follow the specification (it does follow the macOS filesystem normalization algorithm though). More info: search.cpan.org/~tomita/Encode-UTF8Mac-0.04/lib/Encode/…
  • roaima
    roaima over 6 years
    @antonone to be fair though there was no OS specified in the question.
  • antonone
    antonone over 6 years
    @roaima Yes, that's why I've assumed that the answer should work on all systems that are based on Unix/Linux. The answer above works only on macOS. If one's looking for a macOS-specific answer, then it'll work, in part. I just wanted to point that out, because the other day I've lost some time wondering why I have no utf-8-mac on Linux and if this is normal.
  • Peter Krauss
    Peter Krauss over 5 years
    On Ubuntu using sudo apt install icu-devtools to run uconv -x any-nfc, but not solve the simplest problem, e.g. a bugText.txt file with "Iglésias, Bad-á, Good-á" converted by uconv -x any-nfc bugText.txt > goodText.txt stay the same text.