How to replace Unicode characters with ASCII
18,461
Solution 1
It is possible to use hex values in "sed".
echo "Ã" | hexdump -C
00000000 c3 83 0a |...|
00000003
Ok, that character is two byte combination "c3 83". Let's replace it with single byte "A":
echo "Ã" |sed 's/\xc3\x83/A/g'
A
Explanation: \x indicates for "sed" that a hex code follows.
Solution 2
You can use iconv:
iconv -f utf-8 -t ascii//translit
Solution 3
There is also uconv
, from ICU.
Examples:
-
uconv -x "::NFD; [:Nonspacing Mark:] > ; ::NFC;"
: to remove accents -
uconv -x "::Latin; ::Latin-ASCII;"
: for a transliteration latin/ascii -
uconv -x "::Latin; ::Latin-ASCII; ([^\x00-\x7F]) > ;"
: for a transliteration latin/ascii and removal of remaining code points > 0x7F - ...
echo "À l'école ☠" | uconv -x "::Latin; ::Latin-ASCII; ([^\x00-\x7F]) > ;"
gives: A l'ecole
Author by
Sandeep Johal
Updated on June 18, 2022Comments
-
Sandeep Johal almost 2 years
I have the following command to replace Unicode characters with ASCII ones.
sed -i 's/Ã/A/g'
The problem is
Ã
isn't recognized by the sed command in my Unix environment so I'd assume you replace it with its hexadecimal value. What would the syntax look like if I were to useC3
instead?I'm using this command as a template for other characters i'd like to replace with blank spaces such as:
sed -i 's/©/ /g'