How to replace invalid unicode characters in a string in Python?

python string unicode character-encoding

17,415

Solution 1

Thanks to you for your comments. This way I was able to implement a better solution:

    try:
        s2 = codecs.encode(s, "utf-8")
        return (True, s, None)
    except Exception as e:
        ret = codecs.decode(codecs.encode(s, "utf-8", "replace"), "utf-8")
        return (False, ret, e)

Please share any improvements on that solution. Thank you!

Solution 2

If you have a bytestring (undecoded data), use the 'replace' error handler. For example, if your data is (mostly) UTF-8 encoded, then you could use:

decoded_unicode = bytestring.decode('utf-8', 'replace')

and U+FFFD � REPLACEMENT CHARACTER characters will be inserted for any bytes that can't be decoded.

If you wanted to use a different replacement character, it is easy enough to replace these afterwards:

decoded_unicode = decoded_unicode.replace('\ufffd', '#')

Demo:

>>> bytestring = b'F\xc3\xb8\xc3\xb6\xbbB\xc3\xa5r'
>>> bytestring.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 5: invalid start byte
>>> bytestring.decode('utf8', 'replace')
'Føö�Bår'

Solution 3

The right way to do it (at least in python2) is to use unicodedata.normalize:

unicodedata.normalize('NFKD', text).encode('utf-8', 'ignore')

decode('utf-8', 'ignore') will just raise exception.

Solution 4

You have not given an example. Therefore, I have considered one example to answer your question.

x='This is a cat which looks good ðŸ˜Š'
print x
x.replace('ðŸ˜Š','')

The output is:

This is a cat which looks good ðŸ˜Š
'This is a cat which looks good '

View more solutions

17,415

Author by

Regis May

Updated on July 07, 2022

Comments

Regis May almost 2 years

As far as I know it is the concept of python to have only valid characters in a string, but in my case the OS will deliver strings with invalid encodings in path names I have to deal with. So I end up with strings that contain characters that are non-unicode.

In order to correct these problems I need to display these strings somehow. Unfortunately I can not print them because they contain non-unicode characters. Is there an elegant way to replace these characters somehow to at least get some idea of the content of the string?

My idea would be to process these strings character by character and check if the character stored is actually valid unicode. In case of an invalid character I would like to use a certain unicode symbol. But how can I do this? Using codecs seems not to be suitable for that purpose: I already have a string, returned by the operating system, and not a byte array. Converting a string to byte array seems to involve decoding which will fail in my case of course. So it seems that I'm stuck.

Do you have an tips for me how to be able to create such a replacement string?