How can I remove non-ASCII characters but leave periods and spaces?

python text unicode filter ascii

215,100

Solution 1

You can filter all characters from the string that are not printable using string.printable, like this:

>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'

string.printable on my machine contains:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:

''.join(filter(lambda x: x in printable, s))

Solution 2

An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:

    >>>s = u'Good bye in Swedish is Hej d\xe5'
    >>>s = s.encode('ascii',errors='ignore')
    >>>print s
    Good bye in Swedish is Hej d

Edit:

Python3: str -> bytes -> str

>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'

Python2: unicode -> str -> unicode

>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'

Python2: str -> unicode -> str (decode and encode in reverse order)

>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'

Solution 3

According to @artfulrobot, this should be faster than filter and lambda:

import re
re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)

See more examples here Replace non-ASCII characters with a single space

Solution 4

You may use the following code to remove non-English letters:

import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)

This will return

123456790 ABC#%? .()

Solution 5

Your question is ambiguous; the first two sentences taken together imply that you believe that space and "period" are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !"#$%&\'()*+,-./ but includes several others e.g. []{}.

Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?

Please note that your code reads the whole input file as a single string, and your comment ("great solution") to another answer implies that you don't care about newlines in your data. If your file contains two lines like this:

this is line 1
this is line 2

the result would be 'this is line 1this is line 2' ... is that what you really want?

A greater solution would include:

a better name for the filter function than onlyascii

recognition that a filter function merely needs to return a truthy value if the argument is to be retained:

def filter_func(char):
    return char == '\n' or 32 <= ord(char) <= 126
# and later:
filtered_data = filter(filter_func, data).lower()

View more solutions

215,100

Author by

Admin

Updated on April 18, 2021

Comments

Admin about 3 years
I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code:
```
def onlyascii(char):
    if ord(char) < 48 or ord(char) > 127: return ''
    else: return char

def get_my_string(file_path):
    f=open(file_path,'r')
    data=f.read()
    f.close()
    filtered_data=filter(onlyascii, data)
    filtered_data = filtered_data.lower()
    return filtered_data
```
How should I modify onlyascii() to leave spaces and periods? I imagine it's not too complicated but I can't figure it out.
joaquin over 12 years

chr(127) in string.printable ?
joaquin over 12 years

what's up with those printable chars that are below ordinal 48 ?
jterrace over 12 years

chr(127) in string.printable == False
jterrace over 12 years

Do you mean 0b and 0c? They are part of string.whitespace.
joaquin over 12 years

yes, and from the OP: if ord(char) < 48 or ord(char) > 127. About my second comment, I am refering to '*' ,'(', and other printable which are eliminated by the OP...
jterrace over 12 years

Yeah, I was extrapolating that the OP probably meant all printable characters, rather than what was actually said, but might not be the case.
jterrace over 12 years

Slightly simpler: lambda x: 32 <= ord(x) <= 126
jterrace over 12 years

that's not the same as string.printable because it leaves out string.whitespace, although that might be what the OP wants, depends on things like \n and \t.
joaquin over 12 years

@jterrace right, includes space (ord 32) but no returns and tabs
jterrace over 12 years

yeah, just commenting on "this is equivalent to string.printable", but not true
joaquin over 12 years

I edited the answer, thanks! the OP question is misleading if you do not read it carefully.
Admin over 12 years

Thanks! I understand now. Sorry for the confusion - jterrace correctly interpreted my question.
rickcnagy over 10 years

this is also great for just filtering to digits - filter(lambda x: x in string.digits, s)
Xodarap777 over 10 years

I get UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 27
Xodarap777 over 10 years

This is incredibly slow in a large file. Any suggestions?
Xodarap777 over 10 years

This answer is very helpful to those of us coming in to ask something similar to the OP, and your proposed answer is helpfully pythonic. I do, however, find it strange that there isn't a more efficient solution to the problem as you interpreted it (which I often run into) - character by character, this takes a very long time in a very large file.
jterrace over 10 years

@Xodarap777 create a set(string.printable) and re-use it for the filtering. Also don't filter the whole file at once - do it in chunks of 8K-512K
cjbarth over 9 years

The only problem with using filter is that it returns an iterable. If you need a string back (as I did because I needed this when doing list compression) then do this: ''.join(filter(lambda x: x in string.printable, s).
undershock over 9 years

@cjbarth - comment is python 3 specific, but very useful. Thanks!
Ben Liyanage about 9 years

I got that error when I put the actual unicode character in the string via copy paste. When you specify a string as u'thestring' encode works correctly.
Noam Manos over 8 years

Why not use regular expression: re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string) . See this thread stackoverflow.com/a/20079244/658497
gaborous over 8 years

This is the most compatible way of doing the OP's task, I tested in from Python 2.6 to Python 3.5.
gaborous over 8 years

Works only on Py3, but it's elegant.
artfulrobot about 8 years

@NoamManos this was 4-5 times faster for me thatn the join...filter...lambda solution, thanks.
ShadowRanger about 8 years

I suspect changing lambda x: x in printable to printable.__contains__ would make it run faster; the lambda means more Python level code execution, while directly passing the built-in membership test method removes per character byte code execution.
Jonny almost 8 years

PyLint Complains on the use of filter when using the above code. Given that list comprehensions seem to be preferred would using ''.join(x for x in s if x in printable) be a) equivalent, and b) any better?
Jonny almost 8 years

Edit: I realise the above is a generator expression, but does the same apply?
jterrace almost 8 years

@Jonny - it's most likely equivalent, but I'd have to profile it to know for sure
Spc_555 about 7 years

For those who are getting the same error as @Xodarap777 : you should first .decode() the string, and only after that encode. For example s.decode('utf-8').encode('ascii', errors='ignore')
Ctrl-C almost 6 years

@Jonny, The result is the same, time differs (you need to compare if it happens to be a bottleneck). This is easier for an eye - the less the diversity of tools, the faster is reading comprehension. You may want to add an [Enter] before if and indent the second line so if starts just after ( from the first line.
Danilo Souza Morães almost 6 years

This solution answers OP's stated question, but beware that it won't remove non printable characters that are included in ASCII which I think is what OP intended to ask.
Amon over 4 years

Am I the only one who this doesn't work for? Why wouldnt those characters be included in the printable list? like 0 or x for example?
jterrace over 4 years

@CharlesSmith - those are escape sequences
SherylHohman about 4 years

This would not allow for standard ASCII symbols, such as bullet points, degrees symbol, copyright symbol, Yen symbol, etc. Also, your first example includes non-printable symbols, such as BELL, which is undesirable.
Brajesh almost 4 years

when assigning value to a variable it works fine whereas reading from file has no effect on filtering.. Dont know why? any ideas?