Convert between Unicode Normalization Forms on the unix command-line

command-line text-processing conversion unicode

8,831

Solution 1

You can use the uconv utility from ICU. Normalization is achieved through transliteration (-x).

$ uconv -x any-nfd <<<ä | hd
00000000  61 cc 88 0a                                       |a...|
00000004
$ uconv -x any-nfc <<<ä | hd
00000000  c3 a4 0a                                          |...|
00000003

On Debian, Ubuntu and other derivatives, uconv is in the libicu-dev package. On Fedora, Red Hat and other derivatives, and in BSD ports, it's in the icu package.

Solution 2

Python has unicodedata module in its standard library, which allow to translate Unicode representations through unicodedata.normalize() function:

import unicodedata

s1 = 'Spicy Jalape\u00f1o'
s2 = 'Spicy Jalapen\u0303o'

t1 = unicodedata.normalize('NFC', s1)
t2 = unicodedata.normalize('NFC', s2)
print(t1 == t2) 
print(ascii(t1)) 

t3 = unicodedata.normalize('NFD', s1)
t4 = unicodedata.normalize('NFD', s2)
print(t3 == t4)
print(ascii(t3))

Running with Python 3.x:

$ python3 test.py
True
'Spicy Jalape\xf1o'
True
'Spicy Jalapen\u0303o'

Python isn't well suited for shell one liners, but it can be done if you don't want to create external script:

$ python3 -c $'import unicodedata\nprint(unicodedata.normalize("NFC", "ääääää"))'
ääääää

For Python 2.x you have to add encoding line (# -*- coding: utf-8 -*-) and mark strings as Unicode with u character:

$ python -c $'# -*- coding: utf-8 -*-\nimport unicodedata\nprint(unicodedata.normalize("NFC", u"ääääää"))'
ääääää

Solution 3

For completeness, with perl:

$ perl -CSA -MUnicode::Normalize=NFD -e 'print NFD($_) for @ARGV' $'\ue1' | uconv -x name
\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}
$ perl -CSA -MUnicode::Normalize=NFC -e 'print NFC($_) for @ARGV' $'a\u301' | uconv -x name
\N{LATIN SMALL LETTER A WITH ACUTE}

Solution 4

Check it with the tool hexdump:

echo  -e "ä\c" |hexdump -C 

00000000  61 cc 88                                          |a..|
00000003

convert with iconv and check again with hexdump:

echo -e "ä\c" | iconv -f UTF-8-MAC -t UTF-8 |hexdump -C

00000000  c3 a4                                             |..|
00000002

printf '\xc3\xa4'
ä

Solution 5

There's a perl utility called Charlint available from

https://www.w3.org/International/charlint/

which does what you want. You'll also have to download a file from

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

After the first run you'll see Charlint complaining about incompatible entries in that file so you'll have to delete those lines from UnicodeData.txt.

View more solutions

8,831

glts

Updated on September 18, 2022

Comments

glts over 1 year
In Unicode, some character combinations have more than one representation.

For example, the character ä can be represented as
- "ä", that is the codepoint U+00E4 (two bytes c3 a4 in UTF-8 encoding), or as
- "ä", that is the two codepoints U+0061 U+0308 (three bytes 61 cc 88 in UTF-8).
According to the Unicode standard, the two representations are equivalent but in different "normalization forms", see UAX #15: Unicode Normalization Forms.

The unix toolbox has all kinds of text transformation tools, sed, tr, iconv, Perl come to mind. How can I do quick and easy NF conversion on the command-line?
- goldilocks over 10 years
  
  Looks like there is a "Unicode::Normalization" module for perl which should do this kind of thing: search.cpan.org/~sadahiro/Unicode-Normalize-1.16/Normalize.p‌m
- mirabilos over 7 years
  
  @goldilocks if it had a CLI… I mean, I do perl -MUnicode::Normalization -e 'print NFC(… er what comes here now…
glts over 10 years

This works, thanks. You have to install a 30M dev library alongside it though. What's worse, I haven't been able to find proper documentation for uconv itself: where did you find any-nfd? It looks like development of this tool has been abandoned, last update was in 2005.
Gilles 'SO- stop being evil' over 10 years

@glts I found any-nfd by browsing through the list displayed by uconv -L.
antonone over 7 years

This only works on macOS. There is no 'utf-8-mac' on Linux, on FreeBSDs, etc. Also, decomposition by using this encoding does not follow the specification (it does follow the macOS filesystem normalization algorithm though). More info: search.cpan.org/~tomita/Encode-UTF8Mac-0.04/lib/Encode/…
roaima over 6 years

@antonone to be fair though there was no OS specified in the question.
antonone over 6 years

@roaima Yes, that's why I've assumed that the answer should work on all systems that are based on Unix/Linux. The answer above works only on macOS. If one's looking for a macOS-specific answer, then it'll work, in part. I just wanted to point that out, because the other day I've lost some time wondering why I have no utf-8-mac on Linux and if this is normal.
Peter Krauss over 5 years

On Ubuntu using sudo apt install icu-devtools to run uconv -x any-nfc, but not solve the simplest problem, e.g. a bugText.txt file with "Iglésias, Bad-á, Good-á" converted by uconv -x any-nfc bugText.txt > goodText.txt stay the same text.