Stripping non printable characters from a string in python

python string non-printable

150,157

Solution 1

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

import unicodedata, re, itertools, sys

all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

For Python2

import unicodedata, re, sys

all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:

Cc (control): 65
Cf (format): 161
Cs (surrogate): 2048
Co (private-use): 137468
Cn (unassigned): 836601

Edit Adding suggestions from the comments.

Solution 2

As far as I know, the most pythonic/efficient method would be:

import string

filtered_string = filter(lambda x: x in string.printable, myStr)

Solution 3

You could try setting up a filter using the unicodedata.category() function:

import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
  return ''.join(c for c in str if unicodedata.category(c) in printable)

See Table 4-9 on page 175 in the Unicode database character properties for the available categories

Solution 4

In Python 3,

def filter_nonprintable(text):
    import itertools
    # Use characters of control category
    nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0))
    # Use translate to remove all non-printable characters
    return text.translate({character:None for character in nonprintable})

See this StackOverflow post on removing punctuation for how .translate() compares to regex & .replace()

The ranges can be generated via nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc') using the Unicode character database categories as shown by @Ants Aasma.

Solution 5

The following will work with Unicode input and is rather fast...

import sys

# build a table mapping all non-printable characters to None
NOPRINT_TRANS_TABLE = {
    i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
}

def make_printable(s):
    """Replace non-printable characters in a string."""

    # the translate method on str removes characters
    # that map to None from the string
    return s.translate(NOPRINT_TRANS_TABLE)


assert make_printable('Café') == 'Café'
assert make_printable('\x00\x11Hello') == 'Hello'
assert make_printable('') == ''

My own testing suggests this approach is faster than functions that iterate over the string and return a result using str.join.

View more solutions

150,157

Author by

Vinko Vrsalovic

A generalist. Or, better put, jack of all trades, master of none. Currently mastering nothing at stackoverflow.

Updated on February 14, 2022

Comments

Vinko Vrsalovic about 2 years
I use to run
```
$s =~ s/[^[:print:]]//g;
```
on Perl to get rid of non printable characters.

In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.

What would you do?

EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output. curses.ascii.isprint will return false for any unicode character.
Nathan Shively-Sanders over 15 years

You probably want filtered_string = ''.join(filter(lambda x:x in string.printable, myStr) so that you get back a string.
Vinko Vrsalovic over 15 years

Sadly string.printable does not contain unicode characters, and thus ü or ó will not be in the output... maybe there is something else?
Patrick Johnmeyer over 15 years

Is 'Cc' enough here? I don't know, I'm just asking -- it seems to me that some of the other 'C' categories may be candidates for this filter as well.
habnabit over 15 years

You should be using a list comprehension or generator expressions, not filter + lambda. One of these will 99.9% of the time be faster. ''.join(s for s in myStr if s in string.printable)
devon93 over 15 years

The lot of you are correct, of course. I should stop trying to help people while sleep-deprived!
habnabit over 15 years

Unless you're on python 2.3, the inner []s are redundant. "return ''.join(c for c ...)"
Ishbir over 15 years

you started a list comprehension which did not end in your final line. I suggest you remove the opening bracket completely.
Ber over 15 years

Thank you for pointing this out. I edited the post accordingly
Miles almost 15 years

Not quite redundant—they have different meanings (and performance characteristics), though the end result is the same.
Gearoid Murphy about 13 years

Should the other end of the range not be protected too?: "ord(c) <= 126"
Seth over 12 years

This code doesn't work in 2.6 or 3.2, which version does it run in?
Chris Morgan over 12 years

@AaronGallagher: 99.9% faster? From whence do you pluck that figure? The performance comparison is nowhere near that bad.
tripleee over 11 years

But there are Unicode characters which are not printable, too.
Gareth Rees over 11 years

It's perhaps worth turning string.printable into a set before doing the filter.
dotancohen over 11 years

Hi William. This method seems to remove all non-ASCII characters. There are many printable non-ASCII characters in Unicode!
dotancohen over 11 years

This function, as published, removes half of the Hebrew characters. I get the same effect for both of the methods given.
dotancohen almost 11 years

This seems the most direct, straightforward method. Thanks.
Kashyap over 10 years

From performance perspective, wouldn't string.translate() work faster in this case? See stackoverflow.com/questions/265960/…
Oddthinking over 9 years

@ChrisMorgan: Late response, but the claim is it will almost always be faster, not that it will be much, much faster.
Edward Falk over 9 years

This fails for a "narrow" build of python (16-bit unicode). That's the standard build for Mac. stackoverflow.com/questions/7105874
chrisinmtown about 9 years

@ants aasma: pls tell me, how can your approach of building a character class be used to count the control chars in the string (not strip them)? I don't see any suitable method in re.
Dave almost 9 years

@Edward Falk: For the narrow build, put all_chars = (unichr(i) for i in xrange(0x110000) in a try clause, then same with xrange(0x10000) in the except clause -- allows it to work with a "Narrow" build (like OSX)
Dave almost 9 years

@PatrickJohnmeyer You've got a good point, and this bit me. I fixed it by checking if the unicodedata.category(c) is in a set of any of the 'Other' unicode categories (see: fileformat.info/info/unicode/category/index.htm ), ie set(['Cc','Cf','Cn','Co','Cs']). Note that I'm using English fonts, so ymmv using other fonts.
danmichaelo over 8 years

Use all_chars = (unichr(i) for i in xrange(sys.maxunicode)) to avoid the narrow build error.
AXO over 7 years

For me control_chars == '\x00-\x1f\x7f-\x9f' (tested on Python 3.5.2)
Wcan over 6 years

can i apply this on pandas dataframe, if yes please explain how
marsl about 6 years

Be aware: In Python3, filter returns a generator. So either use Nathans ''.join(...) or str(filter(...))
TaiwanGrapefruitTea over 5 years

Here's my version that gives a clue about what was eliminated: ''.join( (s if s in string.printable else 'X') for s in s_string_to_print )
Fabrizio Miano about 5 years

it should be printable = set(['Lu', 'Ll']) shouldn't it ?
Ber about 5 years

@FabrizioMiano You are right. Or set(('Lu', 'Ll')) Thanx
Csaba Toth almost 5 years

@Ber You meant to say printable = {'Lu', 'Ll'} ?
Ber almost 5 years

@CsabaToth All three are valid and yield the same set. Your's is maybe the nicest way to specify a set literal.
Csaba Toth almost 5 years

@Ber All of them result with the same set, certain linters advise you to use the one I advised.
evandrix almost 5 years

"".join([c if 0x21<=ord(c) and ord(c)<=0x7e else "" for c in ss])
Chop Labalagun almost 5 years

This worked super great for me and its 1 line. thanks
Chop Labalagun almost 5 years

for some reason this works great on windows but cant use it on linux, i had to change the f for an r but i am not sure that is the solution.
pir over 4 years

This is the only answer that works for me with unicode characters. Awesome that you provided test cases!
pir over 4 years

If you want to allow for line breaks, add LINE_BREAK_CHARACTERS = set(["\n", "\r"]) and and not chr(i) in LINE_BREAK_CHARACTERS when building the table.
Anudocs over 4 years

but this removes the space in the string. How to maintain the space in the string?
Ber over 4 years

@AnubhavJhalani You can add more Unicode categories to the filter. To reserve spaces and digits in addition to letters use printable = {'Lu', 'Ll', Zs', 'Nd'}
tripleee over 4 years

Sounds like your Linux Python was too old to support f-strings then. r-strings are quite different, though you could say r'[^' + re.escape(string.printable) + r']'. (I don't think re.escape() is entirely correct here, but if it works...)
tripleee over 4 years

Actually you don't need the square brackets either then.
darkdragon almost 4 years

I suggest removing only control characters. See my answer for an example.
darkdragon almost 4 years

It would be better to use Unicode ranges (see @Ants Aasma's answer). The result would be text.translate({c:None for c in itertools.chain(range(0x00,0x20),range(0x7f,0xa0))}).
tdc almost 4 years

This is a great answer!
darkdragon almost 4 years

On Python3 use chr() instead of unichr() and range() instead of xrange(). Furthermore, for combination of the two iterators returned by range() one should use itertools.chain(): itertools.chain(range(), range()). For readability, I suggest to use hex numbers (thanks @AXO) in the static ranges: range(0x00,0x20) and range(0x7f,0xa0).
Big McLargeHuge over 3 years

You may be on to something with startswith('C') but this was far less performant in my testing than any other solution.
darkdragon over 3 years

big-mclargehuge: The goal of my solution was the combination of completeness and simplicity/readability. You could try to use if unicodedata.category(c)[0] != 'C' instead. Does it perform better? If you prefer execution speed over memory requirements, one can pre-compute the table as shown in stackoverflow.com/a/93029/3779655
LoMaPh over 3 years

Not that tab, newline and a few more are part of the printable characters. So if you don't want to include those, you should use string.printable[:-5]
Bill over 3 years

I found that after adding 'Zs' to include spaces this method did not strip the '\xa0' character which Python does not seem to print. It is a 'non-breaking space' apparently. According to this post you need to remove this manually which is a pain.
the_economist about 3 years

Sadly string.printable does not contain unicode characters, and thus ü or ó will not be in the output...