How can I remove non-ASCII characters but leave periods and spaces?
Solution 1
You can filter all characters from the string that are not printable using string.printable, like this:
>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'
string.printable on my machine contains:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c
EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:
''.join(filter(lambda x: x in printable, s))
Solution 2
An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:
>>>s = u'Good bye in Swedish is Hej d\xe5'
>>>s = s.encode('ascii',errors='ignore')
>>>print s
Good bye in Swedish is Hej d
Edit:
Python3: str -> bytes -> str
>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'
Python2: unicode -> str -> unicode
>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'
Python2: str -> unicode -> str (decode and encode in reverse order)
>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'
Solution 3
According to @artfulrobot, this should be faster than filter and lambda:
import re
re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)
See more examples here Replace non-ASCII characters with a single space
Solution 4
You may use the following code to remove non-English letters:
import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)
This will return
123456790 ABC#%? .()
Solution 5
Your question is ambiguous; the first two sentences taken together imply that you believe that space and "period" are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !"#$%&\'()*+,-./ but includes several others e.g. []{}.
Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?
Please note that your code reads the whole input file as a single string, and your comment ("great solution") to another answer implies that you don't care about newlines in your data. If your file contains two lines like this:
this is line 1
this is line 2
the result would be 'this is line 1this is line 2'
... is that what you really want?
A greater solution would include:
- a better name for the filter function than
onlyascii
recognition that a filter function merely needs to return a truthy value if the argument is to be retained:
def filter_func(char): return char == '\n' or 32 <= ord(char) <= 126 # and later: filtered_data = filter(filter_func, data).lower()
Admin
Updated on April 18, 2021Comments
-
Admin about 3 years
I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code:
def onlyascii(char): if ord(char) < 48 or ord(char) > 127: return '' else: return char def get_my_string(file_path): f=open(file_path,'r') data=f.read() f.close() filtered_data=filter(onlyascii, data) filtered_data = filtered_data.lower() return filtered_data
How should I modify onlyascii() to leave spaces and periods? I imagine it's not too complicated but I can't figure it out.
-
joaquin over 12 years
chr(127) in string.printable
? -
joaquin over 12 yearswhat's up with those printable chars that are below ordinal 48 ?
-
jterrace over 12 yearschr(127) in string.printable == False
-
jterrace over 12 yearsDo you mean 0b and 0c? They are part of string.whitespace.
-
joaquin over 12 yearsyes, and from the OP:
if ord(char) < 48 or ord(char) > 127
. About my second comment, I am refering to '*' ,'(', and other printable which are eliminated by the OP... -
jterrace over 12 yearsYeah, I was extrapolating that the OP probably meant all printable characters, rather than what was actually said, but might not be the case.
-
jterrace over 12 yearsSlightly simpler: lambda x: 32 <= ord(x) <= 126
-
jterrace over 12 yearsthat's not the same as string.printable because it leaves out string.whitespace, although that might be what the OP wants, depends on things like \n and \t.
-
joaquin over 12 years@jterrace right, includes space (ord 32) but no returns and tabs
-
jterrace over 12 yearsyeah, just commenting on "this is equivalent to string.printable", but not true
-
joaquin over 12 yearsI edited the answer, thanks! the OP question is misleading if you do not read it carefully.
-
Admin over 12 yearsThanks! I understand now. Sorry for the confusion - jterrace correctly interpreted my question.
-
rickcnagy over 10 yearsthis is also great for just filtering to digits - filter(lambda x: x in string.digits, s)
-
Xodarap777 over 10 yearsI get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 27
-
Xodarap777 over 10 yearsThis is incredibly slow in a large file. Any suggestions?
-
Xodarap777 over 10 yearsThis answer is very helpful to those of us coming in to ask something similar to the OP, and your proposed answer is helpfully pythonic. I do, however, find it strange that there isn't a more efficient solution to the problem as you interpreted it (which I often run into) - character by character, this takes a very long time in a very large file.
-
jterrace over 10 years@Xodarap777 create a
set(string.printable)
and re-use it for the filtering. Also don't filter the whole file at once - do it in chunks of 8K-512K -
cjbarth over 9 yearsThe only problem with using
filter
is that it returns an iterable. If you need a string back (as I did because I needed this when doing list compression) then do this:''.join(filter(lambda x: x in string.printable, s)
. -
undershock over 9 years@cjbarth - comment is python 3 specific, but very useful. Thanks!
-
Ben Liyanage about 9 yearsI got that error when I put the actual unicode character in the string via copy paste. When you specify a string as u'thestring' encode works correctly.
-
Noam Manos over 8 yearsWhy not use regular expression:
re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)
. See this thread stackoverflow.com/a/20079244/658497 -
gaborous over 8 yearsThis is the most compatible way of doing the OP's task, I tested in from Python 2.6 to Python 3.5.
-
gaborous over 8 yearsWorks only on Py3, but it's elegant.
-
artfulrobot about 8 years@NoamManos this was 4-5 times faster for me thatn the join...filter...lambda solution, thanks.
-
ShadowRanger about 8 yearsI suspect changing
lambda x: x in printable
toprintable.__contains__
would make it run faster; thelambda
means more Python level code execution, while directly passing the built-in membership test method removes per character byte code execution. -
Jonny almost 8 yearsPyLint Complains on the use of
filter
when using the above code. Given that list comprehensions seem to be preferred would using''.join(x for x in s if x in printable)
be a) equivalent, and b) any better? -
Jonny almost 8 yearsEdit: I realise the above is a generator expression, but does the same apply?
-
jterrace almost 8 years@Jonny - it's most likely equivalent, but I'd have to profile it to know for sure
-
Spc_555 about 7 yearsFor those who are getting the same error as @Xodarap777 : you should first .decode() the string, and only after that encode. For example
s.decode('utf-8').encode('ascii', errors='ignore')
-
Ctrl-C almost 6 years@Jonny, The result is the same, time differs (you need to compare if it happens to be a bottleneck). This is easier for an eye - the less the diversity of tools, the faster is reading comprehension. You may want to add an [Enter] before
if
and indent the second line soif
starts just after(
from the first line. -
Danilo Souza Morães almost 6 yearsThis solution answers OP's stated question, but beware that it won't remove non printable characters that are included in ASCII which I think is what OP intended to ask.
-
Amon over 4 yearsAm I the only one who this doesn't work for? Why wouldnt those characters be included in the printable list? like
0
orx
for example? -
jterrace over 4 years@CharlesSmith - those are escape sequences
-
SherylHohman about 4 yearsThis would not allow for standard ASCII symbols, such as bullet points, degrees symbol, copyright symbol, Yen symbol, etc. Also, your first example includes non-printable symbols, such as BELL, which is undesirable.
-
Brajesh almost 4 yearswhen assigning value to a variable it works fine whereas reading from file has no effect on filtering.. Dont know why? any ideas?