Stripping non printable characters from a string in python
Solution 1
Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.
import unicodedata, re, itertools, sys
all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))
control_char_re = re.compile('[%s]' % re.escape(control_chars))
def remove_control_chars(s):
return control_char_re.sub('', s)
For Python2
import unicodedata, re, sys
all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))
control_char_re = re.compile('[%s]' % re.escape(control_chars))
def remove_control_chars(s):
return control_char_re.sub('', s)
For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:
-
Cc
(control): 65 -
Cf
(format): 161 -
Cs
(surrogate): 2048 -
Co
(private-use): 137468 -
Cn
(unassigned): 836601
Edit Adding suggestions from the comments.
Solution 2
As far as I know, the most pythonic/efficient method would be:
import string
filtered_string = filter(lambda x: x in string.printable, myStr)
Solution 3
You could try setting up a filter using the unicodedata.category()
function:
import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
return ''.join(c for c in str if unicodedata.category(c) in printable)
See Table 4-9 on page 175 in the Unicode database character properties for the available categories
Solution 4
In Python 3,
def filter_nonprintable(text):
import itertools
# Use characters of control category
nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0))
# Use translate to remove all non-printable characters
return text.translate({character:None for character in nonprintable})
See this StackOverflow post on removing punctuation for how .translate() compares to regex & .replace()
The ranges can be generated via nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc')
using the Unicode character database categories as shown by @Ants Aasma.
Solution 5
The following will work with Unicode input and is rather fast...
import sys
# build a table mapping all non-printable characters to None
NOPRINT_TRANS_TABLE = {
i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
}
def make_printable(s):
"""Replace non-printable characters in a string."""
# the translate method on str removes characters
# that map to None from the string
return s.translate(NOPRINT_TRANS_TABLE)
assert make_printable('Café') == 'Café'
assert make_printable('\x00\x11Hello') == 'Hello'
assert make_printable('') == ''
My own testing suggests this approach is faster than functions that iterate over the string and return a result using str.join
.
Vinko Vrsalovic
A generalist. Or, better put, jack of all trades, master of none. Currently mastering nothing at stackoverflow.
Updated on February 14, 2022Comments
-
Vinko Vrsalovic about 2 years
I use to run
$s =~ s/[^[:print:]]//g;
on Perl to get rid of non printable characters.
In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.
What would you do?
EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output. curses.ascii.isprint will return false for any unicode character.
-
Nathan Shively-Sanders over 15 yearsYou probably want filtered_string = ''.join(filter(lambda x:x in string.printable, myStr) so that you get back a string.
-
Vinko Vrsalovic over 15 yearsSadly string.printable does not contain unicode characters, and thus ü or ó will not be in the output... maybe there is something else?
-
Patrick Johnmeyer over 15 yearsIs 'Cc' enough here? I don't know, I'm just asking -- it seems to me that some of the other 'C' categories may be candidates for this filter as well.
-
habnabit over 15 yearsYou should be using a list comprehension or generator expressions, not filter + lambda. One of these will 99.9% of the time be faster. ''.join(s for s in myStr if s in string.printable)
-
devon93 over 15 yearsThe lot of you are correct, of course. I should stop trying to help people while sleep-deprived!
-
habnabit over 15 yearsUnless you're on python 2.3, the inner []s are redundant. "return ''.join(c for c ...)"
-
Ishbir over 15 yearsyou started a list comprehension which did not end in your final line. I suggest you remove the opening bracket completely.
-
Ber over 15 yearsThank you for pointing this out. I edited the post accordingly
-
Miles almost 15 yearsNot quite redundant—they have different meanings (and performance characteristics), though the end result is the same.
-
Gearoid Murphy about 13 yearsShould the other end of the range not be protected too?: "ord(c) <= 126"
-
Seth over 12 yearsThis code doesn't work in 2.6 or 3.2, which version does it run in?
-
Chris Morgan over 12 years@AaronGallagher: 99.9% faster? From whence do you pluck that figure? The performance comparison is nowhere near that bad.
-
tripleee over 11 yearsBut there are Unicode characters which are not printable, too.
-
Gareth Rees over 11 yearsIt's perhaps worth turning
string.printable
into aset
before doing the filter. -
dotancohen over 11 yearsHi William. This method seems to remove all non-ASCII characters. There are many printable non-ASCII characters in Unicode!
-
dotancohen over 11 yearsThis function, as published, removes half of the Hebrew characters. I get the same effect for both of the methods given.
-
dotancohen almost 11 yearsThis seems the most direct, straightforward method. Thanks.
-
Kashyap over 10 yearsFrom performance perspective, wouldn't string.translate() work faster in this case? See stackoverflow.com/questions/265960/…
-
Oddthinking over 9 years@ChrisMorgan: Late response, but the claim is it will almost always be faster, not that it will be much, much faster.
-
Edward Falk over 9 yearsThis fails for a "narrow" build of python (16-bit unicode). That's the standard build for Mac. stackoverflow.com/questions/7105874
-
chrisinmtown about 9 years@ants aasma: pls tell me, how can your approach of building a character class be used to count the control chars in the string (not strip them)? I don't see any suitable method in re.
-
Dave almost 9 years@Edward Falk: For the narrow build, put all_chars = (unichr(i) for i in xrange(0x110000) in a try clause, then same with xrange(0x10000) in the except clause -- allows it to work with a "Narrow" build (like OSX)
-
Dave almost 9 years@PatrickJohnmeyer You've got a good point, and this bit me. I fixed it by checking if the unicodedata.category(c) is in a set of any of the 'Other' unicode categories (see: fileformat.info/info/unicode/category/index.htm ), ie set(['Cc','Cf','Cn','Co','Cs']). Note that I'm using English fonts, so ymmv using other fonts.
-
danmichaelo over 8 yearsUse
all_chars = (unichr(i) for i in xrange(sys.maxunicode))
to avoid the narrow build error. -
AXO over 7 yearsFor me
control_chars == '\x00-\x1f\x7f-\x9f'
(tested on Python 3.5.2) -
Wcan over 6 yearscan i apply this on pandas dataframe, if yes please explain how
-
marsl about 6 yearsBe aware: In Python3, filter returns a generator. So either use Nathans
''.join(...)
orstr(filter(...))
-
TaiwanGrapefruitTea over 5 yearsHere's my version that gives a clue about what was eliminated: ''.join( (s if s in string.printable else 'X') for s in s_string_to_print )
-
Fabrizio Miano about 5 yearsit should be
printable = set(['Lu', 'Ll'])
shouldn't it ? -
Ber about 5 years@FabrizioMiano You are right. Or set(('Lu', 'Ll')) Thanx
-
Csaba Toth almost 5 years@Ber You meant to say
printable = {'Lu', 'Ll'}
? -
Ber almost 5 years@CsabaToth All three are valid and yield the same set. Your's is maybe the nicest way to specify a set literal.
-
Csaba Toth almost 5 years@Ber All of them result with the same set, certain linters advise you to use the one I advised.
-
evandrix almost 5 years
"".join([c if 0x21<=ord(c) and ord(c)<=0x7e else "" for c in ss])
-
Chop Labalagun almost 5 yearsThis worked super great for me and its 1 line. thanks
-
Chop Labalagun almost 5 yearsfor some reason this works great on windows but cant use it on linux, i had to change the f for an r but i am not sure that is the solution.
-
pir over 4 yearsThis is the only answer that works for me with unicode characters. Awesome that you provided test cases!
-
pir over 4 yearsIf you want to allow for line breaks, add
LINE_BREAK_CHARACTERS = set(["\n", "\r"])
andand not chr(i) in LINE_BREAK_CHARACTERS
when building the table. -
Anudocs over 4 yearsbut this removes the space in the string. How to maintain the space in the string?
-
Ber over 4 years@AnubhavJhalani You can add more Unicode categories to the filter. To reserve spaces and digits in addition to letters use
printable = {'Lu', 'Ll', Zs', 'Nd'}
-
tripleee over 4 yearsSounds like your Linux Python was too old to support f-strings then. r-strings are quite different, though you could say
r'[^' + re.escape(string.printable) + r']'
. (I don't thinkre.escape()
is entirely correct here, but if it works...) -
tripleee over 4 yearsActually you don't need the square brackets either then.
-
darkdragon almost 4 yearsI suggest removing only control characters. See my answer for an example.
-
darkdragon almost 4 yearsIt would be better to use Unicode ranges (see @Ants Aasma's answer). The result would be
text.translate({c:None for c in itertools.chain(range(0x00,0x20),range(0x7f,0xa0))})
. -
tdc almost 4 yearsThis is a great answer!
-
darkdragon almost 4 yearsOn Python3 use
chr()
instead ofunichr()
andrange()
instead ofxrange()
. Furthermore, for combination of the two iterators returned byrange()
one should useitertools.chain()
:itertools.chain(range(), range())
. For readability, I suggest to use hex numbers (thanks @AXO) in the static ranges:range(0x00,0x20)
andrange(0x7f,0xa0)
. -
Big McLargeHuge over 3 yearsYou may be on to something with
startswith('C')
but this was far less performant in my testing than any other solution. -
darkdragon over 3 yearsbig-mclargehuge: The goal of my solution was the combination of completeness and simplicity/readability. You could try to use
if unicodedata.category(c)[0] != 'C'
instead. Does it perform better? If you prefer execution speed over memory requirements, one can pre-compute the table as shown in stackoverflow.com/a/93029/3779655 -
LoMaPh over 3 yearsNot that tab, newline and a few more are part of the printable characters. So if you don't want to include those, you should use
string.printable[:-5]
-
Bill over 3 yearsI found that after adding
'Zs'
to include spaces this method did not strip the'\xa0'
character which Python does not seem to print. It is a 'non-breaking space' apparently. According to this post you need to remove this manually which is a pain. -
the_economist about 3 yearsSadly string.printable does not contain unicode characters, and thus ü or ó will not be in the output...