How can I remove non-ASCII characters but leave periods and spaces?

215,100

Solution 1

You can filter all characters from the string that are not printable using string.printable, like this:

>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'

string.printable on my machine contains:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:

''.join(filter(lambda x: x in printable, s))

Solution 2

An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:

    >>>s = u'Good bye in Swedish is Hej d\xe5'
    >>>s = s.encode('ascii',errors='ignore')
    >>>print s
    Good bye in Swedish is Hej d

Edit:

Python3: str -> bytes -> str

>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'

Python2: unicode -> str -> unicode

>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'

Python2: str -> unicode -> str (decode and encode in reverse order)

>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'

Solution 3

According to @artfulrobot, this should be faster than filter and lambda:

import re
re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string) 

See more examples here Replace non-ASCII characters with a single space

Solution 4

You may use the following code to remove non-English letters:

import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)

This will return

123456790 ABC#%? .()

Solution 5

Your question is ambiguous; the first two sentences taken together imply that you believe that space and "period" are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !"#$%&\'()*+,-./ but includes several others e.g. []{}.

Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?

Please note that your code reads the whole input file as a single string, and your comment ("great solution") to another answer implies that you don't care about newlines in your data. If your file contains two lines like this:

this is line 1
this is line 2

the result would be 'this is line 1this is line 2' ... is that what you really want?

A greater solution would include:

  1. a better name for the filter function than onlyascii
  2. recognition that a filter function merely needs to return a truthy value if the argument is to be retained:

    def filter_func(char):
        return char == '\n' or 32 <= ord(char) <= 126
    # and later:
    filtered_data = filter(filter_func, data).lower()
    
Share:
215,100
Admin
Author by

Admin

Updated on April 18, 2021

Comments

  • Admin
    Admin about 3 years

    I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code:

    def onlyascii(char):
        if ord(char) < 48 or ord(char) > 127: return ''
        else: return char
    
    def get_my_string(file_path):
        f=open(file_path,'r')
        data=f.read()
        f.close()
        filtered_data=filter(onlyascii, data)
        filtered_data = filtered_data.lower()
        return filtered_data
    

    How should I modify onlyascii() to leave spaces and periods? I imagine it's not too complicated but I can't figure it out.

  • joaquin
    joaquin over 12 years
    chr(127) in string.printable ?
  • joaquin
    joaquin over 12 years
    what's up with those printable chars that are below ordinal 48 ?
  • jterrace
    jterrace over 12 years
    chr(127) in string.printable == False
  • jterrace
    jterrace over 12 years
    Do you mean 0b and 0c? They are part of string.whitespace.
  • joaquin
    joaquin over 12 years
    yes, and from the OP: if ord(char) < 48 or ord(char) > 127. About my second comment, I am refering to '*' ,'(', and other printable which are eliminated by the OP...
  • jterrace
    jterrace over 12 years
    Yeah, I was extrapolating that the OP probably meant all printable characters, rather than what was actually said, but might not be the case.
  • jterrace
    jterrace over 12 years
    Slightly simpler: lambda x: 32 <= ord(x) <= 126
  • jterrace
    jterrace over 12 years
    that's not the same as string.printable because it leaves out string.whitespace, although that might be what the OP wants, depends on things like \n and \t.
  • joaquin
    joaquin over 12 years
    @jterrace right, includes space (ord 32) but no returns and tabs
  • jterrace
    jterrace over 12 years
    yeah, just commenting on "this is equivalent to string.printable", but not true
  • joaquin
    joaquin over 12 years
    I edited the answer, thanks! the OP question is misleading if you do not read it carefully.
  • Admin
    Admin over 12 years
    Thanks! I understand now. Sorry for the confusion - jterrace correctly interpreted my question.
  • rickcnagy
    rickcnagy over 10 years
    this is also great for just filtering to digits - filter(lambda x: x in string.digits, s)
  • Xodarap777
    Xodarap777 over 10 years
    I get UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 27
  • Xodarap777
    Xodarap777 over 10 years
    This is incredibly slow in a large file. Any suggestions?
  • Xodarap777
    Xodarap777 over 10 years
    This answer is very helpful to those of us coming in to ask something similar to the OP, and your proposed answer is helpfully pythonic. I do, however, find it strange that there isn't a more efficient solution to the problem as you interpreted it (which I often run into) - character by character, this takes a very long time in a very large file.
  • jterrace
    jterrace over 10 years
    @Xodarap777 create a set(string.printable) and re-use it for the filtering. Also don't filter the whole file at once - do it in chunks of 8K-512K
  • cjbarth
    cjbarth over 9 years
    The only problem with using filter is that it returns an iterable. If you need a string back (as I did because I needed this when doing list compression) then do this: ''.join(filter(lambda x: x in string.printable, s).
  • undershock
    undershock over 9 years
    @cjbarth - comment is python 3 specific, but very useful. Thanks!
  • Ben Liyanage
    Ben Liyanage about 9 years
    I got that error when I put the actual unicode character in the string via copy paste. When you specify a string as u'thestring' encode works correctly.
  • Noam Manos
    Noam Manos over 8 years
    Why not use regular expression: re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string) . See this thread stackoverflow.com/a/20079244/658497
  • gaborous
    gaborous over 8 years
    This is the most compatible way of doing the OP's task, I tested in from Python 2.6 to Python 3.5.
  • gaborous
    gaborous over 8 years
    Works only on Py3, but it's elegant.
  • artfulrobot
    artfulrobot about 8 years
    @NoamManos this was 4-5 times faster for me thatn the join...filter...lambda solution, thanks.
  • ShadowRanger
    ShadowRanger about 8 years
    I suspect changing lambda x: x in printable to printable.__contains__ would make it run faster; the lambda means more Python level code execution, while directly passing the built-in membership test method removes per character byte code execution.
  • Jonny
    Jonny almost 8 years
    PyLint Complains on the use of filter when using the above code. Given that list comprehensions seem to be preferred would using ''.join(x for x in s if x in printable) be a) equivalent, and b) any better?
  • Jonny
    Jonny almost 8 years
    Edit: I realise the above is a generator expression, but does the same apply?
  • jterrace
    jterrace almost 8 years
    @Jonny - it's most likely equivalent, but I'd have to profile it to know for sure
  • Spc_555
    Spc_555 about 7 years
    For those who are getting the same error as @Xodarap777 : you should first .decode() the string, and only after that encode. For example s.decode('utf-8').encode('ascii', errors='ignore')
  • Ctrl-C
    Ctrl-C almost 6 years
    @Jonny, The result is the same, time differs (you need to compare if it happens to be a bottleneck). This is easier for an eye - the less the diversity of tools, the faster is reading comprehension. You may want to add an [Enter] before if and indent the second line so if starts just after ( from the first line.
  • Danilo Souza Morães
    Danilo Souza Morães almost 6 years
    This solution answers OP's stated question, but beware that it won't remove non printable characters that are included in ASCII which I think is what OP intended to ask.
  • Amon
    Amon over 4 years
    Am I the only one who this doesn't work for? Why wouldnt those characters be included in the printable list? like 0 or x for example?
  • jterrace
    jterrace over 4 years
    @CharlesSmith - those are escape sequences
  • SherylHohman
    SherylHohman about 4 years
    This would not allow for standard ASCII symbols, such as bullet points, degrees symbol, copyright symbol, Yen symbol, etc. Also, your first example includes non-printable symbols, such as BELL, which is undesirable.
  • Brajesh
    Brajesh almost 4 years
    when assigning value to a variable it works fine whereas reading from file has no effect on filtering.. Dont know why? any ideas?