Convert between Unicode Normalization Forms on the unix command-line
Solution 1
You can use the uconv
utility from ICU. Normalization is achieved through transliteration (-x
).
$ uconv -x any-nfd <<<ä | hd
00000000 61 cc 88 0a |a...|
00000004
$ uconv -x any-nfc <<<ä | hd
00000000 c3 a4 0a |...|
00000003
On Debian, Ubuntu and other derivatives, uconv
is in the libicu-dev
package. On Fedora, Red Hat and other derivatives, and in BSD ports, it's in the icu
package.
Solution 2
Python has unicodedata
module in its standard library, which allow to translate Unicode representations through unicodedata.normalize()
function:
import unicodedata
s1 = 'Spicy Jalape\u00f1o'
s2 = 'Spicy Jalapen\u0303o'
t1 = unicodedata.normalize('NFC', s1)
t2 = unicodedata.normalize('NFC', s2)
print(t1 == t2)
print(ascii(t1))
t3 = unicodedata.normalize('NFD', s1)
t4 = unicodedata.normalize('NFD', s2)
print(t3 == t4)
print(ascii(t3))
Running with Python 3.x:
$ python3 test.py
True
'Spicy Jalape\xf1o'
True
'Spicy Jalapen\u0303o'
Python isn't well suited for shell one liners, but it can be done if you don't want to create external script:
$ python3 -c $'import unicodedata\nprint(unicodedata.normalize("NFC", "ääääää"))'
ääääää
For Python 2.x you have to add encoding line (# -*- coding: utf-8 -*-
) and mark strings as Unicode with u character:
$ python -c $'# -*- coding: utf-8 -*-\nimport unicodedata\nprint(unicodedata.normalize("NFC", u"ääääää"))'
ääääää
Solution 3
For completeness, with perl
:
$ perl -CSA -MUnicode::Normalize=NFD -e 'print NFD($_) for @ARGV' $'\ue1' | uconv -x name
\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}
$ perl -CSA -MUnicode::Normalize=NFC -e 'print NFC($_) for @ARGV' $'a\u301' | uconv -x name
\N{LATIN SMALL LETTER A WITH ACUTE}
Solution 4
Check it with the tool hexdump:
echo -e "ä\c" |hexdump -C
00000000 61 cc 88 |a..|
00000003
convert with iconv and check again with hexdump:
echo -e "ä\c" | iconv -f UTF-8-MAC -t UTF-8 |hexdump -C
00000000 c3 a4 |..|
00000002
printf '\xc3\xa4'
ä
Solution 5
There's a perl utility called Charlint available from
https://www.w3.org/International/charlint/
which does what you want. You'll also have to download a file from
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
After the first run you'll see Charlint complaining about incompatible entries in that file so you'll have to delete those lines from UnicodeData.txt.
Related videos on Youtube
glts
Updated on September 18, 2022Comments
-
glts over 1 year
In Unicode, some character combinations have more than one representation.
For example, the character ä can be represented as
- "ä", that is the codepoint U+00E4 (two bytes
c3 a4
in UTF-8 encoding), or as - "ä", that is the two codepoints U+0061 U+0308 (three bytes
61 cc 88
in UTF-8).
According to the Unicode standard, the two representations are equivalent but in different "normalization forms", see UAX #15: Unicode Normalization Forms.
The unix toolbox has all kinds of text transformation tools, sed, tr, iconv, Perl come to mind. How can I do quick and easy NF conversion on the command-line?
-
goldilocks over 10 yearsLooks like there is a "Unicode::Normalization" module for perl which should do this kind of thing: search.cpan.org/~sadahiro/Unicode-Normalize-1.16/Normalize.pm
-
mirabilos over 7 years@goldilocks if it had a CLI… I mean, I do
perl -MUnicode::Normalization -e 'print NFC(
… er what comes here now…
- "ä", that is the codepoint U+00E4 (two bytes
-
glts over 10 yearsThis works, thanks. You have to install a 30M dev library alongside it though. What's worse, I haven't been able to find proper documentation for uconv itself: where did you find
any-nfd
? It looks like development of this tool has been abandoned, last update was in 2005. -
Gilles 'SO- stop being evil' over 10 years@glts I found
any-nfd
by browsing through the list displayed byuconv -L
. -
antonone over 7 yearsThis only works on macOS. There is no 'utf-8-mac' on Linux, on FreeBSDs, etc. Also, decomposition by using this encoding does not follow the specification (it does follow the macOS filesystem normalization algorithm though). More info: search.cpan.org/~tomita/Encode-UTF8Mac-0.04/lib/Encode/…
-
roaima over 6 years@antonone to be fair though there was no OS specified in the question.
-
antonone over 6 years@roaima Yes, that's why I've assumed that the answer should work on all systems that are based on Unix/Linux. The answer above works only on macOS. If one's looking for a macOS-specific answer, then it'll work, in part. I just wanted to point that out, because the other day I've lost some time wondering why I have no
utf-8-mac
on Linux and if this is normal. -
Peter Krauss over 5 yearsOn Ubuntu using
sudo apt install icu-devtools
to runuconv -x any-nfc
, but not solve the simplest problem, e.g. abugText.txt
file with "Iglésias, Bad-á, Good-á" converted byuconv -x any-nfc bugText.txt > goodText.txt
stay the same text.