Stripping non printable characters from a string in python

150,157

Solution 1

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

import unicodedata, re, itertools, sys

all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

For Python2

import unicodedata, re, sys

all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:

  • Cc (control): 65
  • Cf (format): 161
  • Cs (surrogate): 2048
  • Co (private-use): 137468
  • Cn (unassigned): 836601

Edit Adding suggestions from the comments.

Solution 2

As far as I know, the most pythonic/efficient method would be:

import string

filtered_string = filter(lambda x: x in string.printable, myStr)

Solution 3

You could try setting up a filter using the unicodedata.category() function:

import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
  return ''.join(c for c in str if unicodedata.category(c) in printable)

See Table 4-9 on page 175 in the Unicode database character properties for the available categories

Solution 4

In Python 3,

def filter_nonprintable(text):
    import itertools
    # Use characters of control category
    nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0))
    # Use translate to remove all non-printable characters
    return text.translate({character:None for character in nonprintable})

See this StackOverflow post on removing punctuation for how .translate() compares to regex & .replace()

The ranges can be generated via nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc') using the Unicode character database categories as shown by @Ants Aasma.

Solution 5

The following will work with Unicode input and is rather fast...

import sys

# build a table mapping all non-printable characters to None
NOPRINT_TRANS_TABLE = {
    i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
}

def make_printable(s):
    """Replace non-printable characters in a string."""

    # the translate method on str removes characters
    # that map to None from the string
    return s.translate(NOPRINT_TRANS_TABLE)


assert make_printable('Café') == 'Café'
assert make_printable('\x00\x11Hello') == 'Hello'
assert make_printable('') == ''

My own testing suggests this approach is faster than functions that iterate over the string and return a result using str.join.

Share:
150,157
Vinko Vrsalovic
Author by

Vinko Vrsalovic

A generalist. Or, better put, jack of all trades, master of none. Currently mastering nothing at stackoverflow.

Updated on February 14, 2022

Comments

  • Vinko Vrsalovic
    Vinko Vrsalovic about 2 years

    I use to run

    $s =~ s/[^[:print:]]//g;
    

    on Perl to get rid of non printable characters.

    In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.

    What would you do?

    EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output. curses.ascii.isprint will return false for any unicode character.

  • Nathan Shively-Sanders
    Nathan Shively-Sanders over 15 years
    You probably want filtered_string = ''.join(filter(lambda x:x in string.printable, myStr) so that you get back a string.
  • Vinko Vrsalovic
    Vinko Vrsalovic over 15 years
    Sadly string.printable does not contain unicode characters, and thus ü or ó will not be in the output... maybe there is something else?
  • Patrick Johnmeyer
    Patrick Johnmeyer over 15 years
    Is 'Cc' enough here? I don't know, I'm just asking -- it seems to me that some of the other 'C' categories may be candidates for this filter as well.
  • habnabit
    habnabit over 15 years
    You should be using a list comprehension or generator expressions, not filter + lambda. One of these will 99.9% of the time be faster. ''.join(s for s in myStr if s in string.printable)
  • devon93
    devon93 over 15 years
    The lot of you are correct, of course. I should stop trying to help people while sleep-deprived!
  • habnabit
    habnabit over 15 years
    Unless you're on python 2.3, the inner []s are redundant. "return ''.join(c for c ...)"
  • Ishbir
    Ishbir over 15 years
    you started a list comprehension which did not end in your final line. I suggest you remove the opening bracket completely.
  • Ber
    Ber over 15 years
    Thank you for pointing this out. I edited the post accordingly
  • Miles
    Miles almost 15 years
    Not quite redundant—they have different meanings (and performance characteristics), though the end result is the same.
  • Gearoid Murphy
    Gearoid Murphy about 13 years
    Should the other end of the range not be protected too?: "ord(c) <= 126"
  • Seth
    Seth over 12 years
    This code doesn't work in 2.6 or 3.2, which version does it run in?
  • Chris Morgan
    Chris Morgan over 12 years
    @AaronGallagher: 99.9% faster? From whence do you pluck that figure? The performance comparison is nowhere near that bad.
  • tripleee
    tripleee over 11 years
    But there are Unicode characters which are not printable, too.
  • Gareth Rees
    Gareth Rees over 11 years
    It's perhaps worth turning string.printable into a set before doing the filter.
  • dotancohen
    dotancohen over 11 years
    Hi William. This method seems to remove all non-ASCII characters. There are many printable non-ASCII characters in Unicode!
  • dotancohen
    dotancohen over 11 years
    This function, as published, removes half of the Hebrew characters. I get the same effect for both of the methods given.
  • dotancohen
    dotancohen almost 11 years
    This seems the most direct, straightforward method. Thanks.
  • Kashyap
    Kashyap over 10 years
    From performance perspective, wouldn't string.translate() work faster in this case? See stackoverflow.com/questions/265960/…
  • Oddthinking
    Oddthinking over 9 years
    @ChrisMorgan: Late response, but the claim is it will almost always be faster, not that it will be much, much faster.
  • Edward Falk
    Edward Falk over 9 years
    This fails for a "narrow" build of python (16-bit unicode). That's the standard build for Mac. stackoverflow.com/questions/7105874
  • chrisinmtown
    chrisinmtown about 9 years
    @ants aasma: pls tell me, how can your approach of building a character class be used to count the control chars in the string (not strip them)? I don't see any suitable method in re.
  • Dave
    Dave almost 9 years
    @Edward Falk: For the narrow build, put all_chars = (unichr(i) for i in xrange(0x110000) in a try clause, then same with xrange(0x10000) in the except clause -- allows it to work with a "Narrow" build (like OSX)
  • Dave
    Dave almost 9 years
    @PatrickJohnmeyer You've got a good point, and this bit me. I fixed it by checking if the unicodedata.category(c) is in a set of any of the 'Other' unicode categories (see: fileformat.info/info/unicode/category/index.htm ), ie set(['Cc','Cf','Cn','Co','Cs']). Note that I'm using English fonts, so ymmv using other fonts.
  • danmichaelo
    danmichaelo over 8 years
    Use all_chars = (unichr(i) for i in xrange(sys.maxunicode)) to avoid the narrow build error.
  • AXO
    AXO over 7 years
    For me control_chars == '\x00-\x1f\x7f-\x9f' (tested on Python 3.5.2)
  • Wcan
    Wcan over 6 years
    can i apply this on pandas dataframe, if yes please explain how
  • marsl
    marsl about 6 years
    Be aware: In Python3, filter returns a generator. So either use Nathans ''.join(...) or str(filter(...))
  • TaiwanGrapefruitTea
    TaiwanGrapefruitTea over 5 years
    Here's my version that gives a clue about what was eliminated: ''.join( (s if s in string.printable else 'X') for s in s_string_to_print )
  • Fabrizio Miano
    Fabrizio Miano about 5 years
    it should be printable = set(['Lu', 'Ll']) shouldn't it ?
  • Ber
    Ber about 5 years
    @FabrizioMiano You are right. Or set(('Lu', 'Ll')) Thanx
  • Csaba Toth
    Csaba Toth almost 5 years
    @Ber You meant to say printable = {'Lu', 'Ll'} ?
  • Ber
    Ber almost 5 years
    @CsabaToth All three are valid and yield the same set. Your's is maybe the nicest way to specify a set literal.
  • Csaba Toth
    Csaba Toth almost 5 years
    @Ber All of them result with the same set, certain linters advise you to use the one I advised.
  • evandrix
    evandrix almost 5 years
    "".join([c if 0x21<=ord(c) and ord(c)<=0x7e else "" for c in ss])
  • Chop Labalagun
    Chop Labalagun almost 5 years
    This worked super great for me and its 1 line. thanks
  • Chop Labalagun
    Chop Labalagun almost 5 years
    for some reason this works great on windows but cant use it on linux, i had to change the f for an r but i am not sure that is the solution.
  • pir
    pir over 4 years
    This is the only answer that works for me with unicode characters. Awesome that you provided test cases!
  • pir
    pir over 4 years
    If you want to allow for line breaks, add LINE_BREAK_CHARACTERS = set(["\n", "\r"]) and and not chr(i) in LINE_BREAK_CHARACTERS when building the table.
  • Anudocs
    Anudocs over 4 years
    but this removes the space in the string. How to maintain the space in the string?
  • Ber
    Ber over 4 years
    @AnubhavJhalani You can add more Unicode categories to the filter. To reserve spaces and digits in addition to letters use printable = {'Lu', 'Ll', Zs', 'Nd'}
  • tripleee
    tripleee over 4 years
    Sounds like your Linux Python was too old to support f-strings then. r-strings are quite different, though you could say r'[^' + re.escape(string.printable) + r']'. (I don't think re.escape() is entirely correct here, but if it works...)
  • tripleee
    tripleee over 4 years
    Actually you don't need the square brackets either then.
  • darkdragon
    darkdragon almost 4 years
    I suggest removing only control characters. See my answer for an example.
  • darkdragon
    darkdragon almost 4 years
    It would be better to use Unicode ranges (see @Ants Aasma's answer). The result would be text.translate({c:None for c in itertools.chain(range(0x00,0x20),range(0x7f,0xa0))}).
  • tdc
    tdc almost 4 years
    This is a great answer!
  • darkdragon
    darkdragon almost 4 years
    On Python3 use chr() instead of unichr() and range() instead of xrange(). Furthermore, for combination of the two iterators returned by range() one should use itertools.chain(): itertools.chain(range(), range()). For readability, I suggest to use hex numbers (thanks @AXO) in the static ranges: range(0x00,0x20) and range(0x7f,0xa0).
  • Big McLargeHuge
    Big McLargeHuge over 3 years
    You may be on to something with startswith('C') but this was far less performant in my testing than any other solution.
  • darkdragon
    darkdragon over 3 years
    big-mclargehuge: The goal of my solution was the combination of completeness and simplicity/readability. You could try to use if unicodedata.category(c)[0] != 'C' instead. Does it perform better? If you prefer execution speed over memory requirements, one can pre-compute the table as shown in stackoverflow.com/a/93029/3779655
  • LoMaPh
    LoMaPh over 3 years
    Not that tab, newline and a few more are part of the printable characters. So if you don't want to include those, you should use string.printable[:-5]
  • Bill
    Bill over 3 years
    I found that after adding 'Zs' to include spaces this method did not strip the '\xa0' character which Python does not seem to print. It is a 'non-breaking space' apparently. According to this post you need to remove this manually which is a pain.
  • the_economist
    the_economist about 3 years
    Sadly string.printable does not contain unicode characters, and thus ü or ó will not be in the output...