Replace accented or special characters using sed or tr unix command using Unicode Code or Hex

11,896

Your script

…works fine for me. Every substitution is performed as expected, except for one:

-e "s/\'//g" \

should be

-e "s/'//g" \

(There's no need to escape the single quote, your expression is between double quotes.)

Applied to a file containing

"'$%&@^`|~¡¨´¢£§¬°·¹²³¿ªàáâãäåæ

it ouputs:

  S E a   i  c   o.123 aaaaaaaae

(Without spaces. I added them to make it easier to compare orginialm pattern and substitution.)

Hexa code

For replacing with hexadecimal code, use following syntax:

echo ¢ | sed 's/\xC2\xA2/cent/g'

Why is so? An hexadecimal value XX is given to sed with \xXX syntax (see info sed). And for your ¢ character, the third column of table on webpage you link gives 0xc2 0xa2.

Encoding

As you are trying to replace UTF-8 encoded characters, I assume your file uses UTF-8 encoding. If it is not, a quick solution would be to convert it (or a copy of it) into UTF-8 (e.g. with your favorite text editor).

Share:
11,896

Related videos on Youtube

user2727262
Author by

user2727262

Updated on June 04, 2022

Comments

  • user2727262
    user2727262 almost 2 years

    I wonder if I can use character set found in http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=0x&unicodeinhtml=hex to replace accented or special characters using sed or tr.

    I have a script that uses sed command. Sometimes it does not work :(

    it goes like this:

    sed -e "s/\"//g" \
                    -e "s/\'//g" \
                    -e "s/[$]/S/g" \
                    -e "s/%//g" \
                    -e "s/&/E/g" \
                    -e "s/@/a/g" \
                    -e "s/\^//g" \
                    -e "s/\`//g" \
                    -e "s/|//g" \
                    -e "s/~//g" \
                    -e "s/¡/i/g" \
                    -e "s/¨//g" \
                    -e "s/\´//g" \
                    -e "s/¢/c/g" \
                    -e "s/£//g" \
                    -e "s/§//g" \
                    -e "s/¬//g" \
                    -e "s/°/o/g" \
                    -e "s/·/./g" \
                    -e "s/¹/1/g" \
                    -e "s/²/2/g" \
                    -e "s/³/3/g" \
                    -e "s/¿//g" \
                    -e "s/ª/a/g" \
                    -e "s/à/a/g" \
                    -e "s/á/a/g" \
                    -e "s/â/a/g" \
                    -e "s/ã/a/g" \
                    -e "s/ä/a/g" \
                    -e "s/å/a/g" \
                    -e "s/æ/ae/g" \
    

    Os, I am thinking if I use hex or octal unicode codes to be used in sed, it would work. But I do not know how...

    e.g. echo ¢ | sed 's/\x{00A2}/cent/g'
    

    I appreciate your help.

    • Qeole
      Qeole almost 10 years
      Sometimes it does not work -> If it is related to some cases, could you detail which ones make it fail?
    • Qeole
      Qeole almost 10 years
      For second question: try echo ¢ | sed 's/\xC2\xA2/cent/g', works for me.
    • user2727262
      user2727262 almost 10 years
      Thanks @Qeole but it did not work for me. I am using aix btw. My sed command will not work if the file I am trying to replace was not created in a UTF-8 w/o BOM format file. At least that was what I have observed.
    • Qeole
      Qeole almost 10 years
      That's something you should definitely have precised. Can't you just reencode your file (or a copy of it) into UTF-8 first?
  • mcepl
    mcepl about 4 years
    And if you don’t know how to get those hexadecimal values for Unicode characters, then GNU echo can help: echo -ne '\u00A0'|xxd.