How to remove non-ascii chars using sed

22,861

Solution 1

This doesn't seem to work with sed. Perhaps tr will do?

tr -d '\200-\377'

Or with the complement:

tr -cd '\000-\177'

Solution 2

Did you try

cat /bin/mkdir | tr -cd "[:print:]"

I think it solves the problem ?

If only text content interest you, you can also use

cat /bin/mkdir | strings

Solution 3

Do you know what encoding the file is currently using? If so, you can use iconv to convert it. It's a utility to convert from one character encoding to another. So if the original file is in UTF-8 and you want to convert to ASCII you can use the following:

iconv -f utf8 -t ascii <inputfile>

The file command on the input file might tell you the current encoding.

Interestingly, there's a command called enca which will do its best to determine the character encoding being used if you know the language of the contents of the file.

This other question might be the answer.

Solution 4

The solutions offered here did not work for me. Maybe my problem was different, but I needed to strip the ASCII colors and other characters from the otherwise pure ASCII text.

The following worked for me, however:

Stripping Escape Codes from ASCII Text

sed -E 's/\x1b\[[0-9]*;?[0-9]+m//g'

In context (BASH):

$ printf "\e[32;1mhello\e[0m\n"
hello

$ printf "\e[32;1mhello\e[0m\n" | cat -vet
^[[32;1mhello^[[0m$

$ printf "\e[32;1mhello\e[0m\n" | sed -E 's/\x1b\[[0-9]*;?[0-9]+m//g' | cat -vet
hello$
Share:
22,861
user87005
Author by

user87005

Updated on May 22, 2020

Comments

  • user87005
    user87005 almost 4 years

    I want to remove non-ascii chars from some file. I have already tried these many regexs.

    sed -e 's/[\d00-\d128]//g'  # not working
    
    cat /bin/mkdir | sed -e 's/[\x00-\x7F]//g' >/tmp/aa
    

    but this file contains some non-ascii chars.

    [root@asssdsada ~]$ hexdump /tmp/aa |more
              00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F  0123456789ABCDEF
    
    00000000  45 4C 46 B0 F0 73 38 C0 - C0 BC BC FF FF 61 61 61  ELF..s8......aaa
    00000010  A0 A0 50 E5 74 64 50 57 - 50 57 50 57 D4 D4 51 E5  ..P.tdPWPWPW..Q.
    00000020  74 64 6C 69 62 36 34 6C - 64 6C 69 6E 75 78 78 38  tdlib64ldlinuxx8
    00000030  36 36 34 73 6F 32 47 4E - 55 42 C8 C0 80 70 69 42  664so2GNUB...piB
    00000040  44 47 BA E3 92 43 45 D5 - EC 46 E4 DE D8 71 58 B9  DG...CE..F...qX.
    00000050  8D F1 EA D3 EF 4B 86 FC - A9 DA 79 ED 63 B5 51 92  .....K....y.c.Q.
    00000060  BA 6C FC D1 69 78 30 ED - 74 F1 73 95 CC 85 D2 46  .l..ix0.t.s....F
    00000070  A5 B4 6C 67 DA 4A E9 9A - 4B 58 77 A4 37 80 C0 4F  ..lg.J..KXw.7..O
    00000080  F3 E9 B2 77 65 97 74 F9 - A2 C0 F2 CC 4A 9C 58 A1  ...we.t.....J.X.
    
  • user87005
    user87005 about 11 years
    I am working on linux system, which is having very limited pool of commands and 'iconv' is not available.
  • chooban
    chooban about 11 years
    Looking at the output from hexdump, is this a binary file? (Guessing from ELF at the start) If so, what's the purpose of removing non-ascii characters? The binary will be corrupted.
  • user87005
    user87005 about 11 years
    it is just an example, friend.
  • chooban
    chooban about 11 years
    Ah, cool. I've added a link to a related question which might solve your problems.
  • user87005
    user87005 about 11 years
    same is working fine with this perl command, But I need sed. cat /bin/mkdir | perl -ne 's/[^[:ascii:]]//g;print $_;'
  • user87005
    user87005 about 11 years
    enca also not available on my machine. % enca enca: Command not found.
  • chooban
    chooban about 11 years
    Rather than cat'ing the file, have you tried just passing it to sed as an argument?
  • Thor
    Thor about 7 years
    @EladTabak: It should work. Can you produce an example where is does not work?
  • codeforester
    codeforester over 5 years
    On macOS High Sierra, I get this error: tr: Illegal byte sequence.
  • Thor
    Thor over 5 years
    @codeforester: I tested this with tr from GNU coreutils
  • ericcurtin
    ericcurtin over 4 years
    tr -cd '\001-\177' removes the NULL character do, worth removing, many tools such as grep recognize input as binary if it contains nulls. Binary file (standard input) matches.
  • Uncle Iroh
    Uncle Iroh almost 3 years
    For me at least -- running this on my file removed all lowercase ascii characters