Remove lines that contain non-english (Ascii) characters from a file

11,541

Solution 1

Perl supports an [:ascii:] character class.

perl -nle 'print if m{^[[:ascii:]]+$}' inputfile

Solution 2

With GNU grep, which supports perl compatible regular expressions, you can use:

grep -P '^[[:ascii:]]+$' file

Solution 3

You can use egrep -v to return only lines not matching the pattern and use something like [^ a-zA-Z0-9.,;:-'"?!] as pattern (include more punctuation as needed).

Hm, thinking about it, a double negation (-v and the inverted character class) is probably not that good. Another way might be ^[ a-zA-Z0-9.,;:-'"?!]*$.

You can also just filter for ASCII:

egrep -v "[^ -~]" foo.txt
Share:
11,541
Sudar
Author by

Sudar

I am a developer from Chennai, India, mostly interested in WordPress, Android and Arduino programming. I write about my projects at my blog. You can also checkout my code that I have released as open source in my github account. You can follow me on twitter.

Updated on June 26, 2022

Comments

  • Sudar
    Sudar almost 2 years

    I have a text file with characters from different languages like (chinese, latin etc)

    I want to remove all lines that contain these non-English characters. I want to include all English characters (a-b), numbers (0-9) and all punctuations.

    How can I do it using unix tools like awk or sed.

  • Sudar
    Sudar almost 12 years
    I need all the punctuation. So is it possible to just filter out all non-ascii characters, instead of specifying the list of allowed characters?
  • Lily
    Lily almost 9 years
    It's perhaps tangential to your main point, but the range [a-z] includes non-ascii characters in some locales. For example, on my system, echo é | grep [a-z] does find the é.