Converting to safe unicode in python

10,376

Solution 1

What is the original encoding? I'm assuming "cp1252", from pixelbeat's answer. In that case, you can do

>>> orig # Byte string, encoded in cp1252
'Fabulous home on one of Decatur\x92s most' 

>>> uni = orig.decode('cp1252')
>>> uni # Unicode string
u'Fabulous home on one of Decatur\u2019s most'

>>> s = uni.encode('utf8')  
>>> s # Correct byte string encoded in utf-8
'Fabulous home on one of Decatur\xe2\x80\x99s most'

Solution 2

0x92 is right single curly quote in windows cp1252 encoding.

\xEF\xBF\xBD is the UTF8 encoding of the unicode replacement character (which was inserted instead of the erroneous cp1252 character).

So it looks like your database is not accepting the valid UTF8 data?

2 options: 1. Perhaps you should be using unicode(content,"cp1252") 2. If you want to insert UTF-8 into the DB, then you'll need to config it appropriately. I'll leave that answer to others more knowledgeable

Solution 3

The "Fabulous..." string doesn't look like utf-8: 0x92 is above 128 and as such should be a continuation of a multi-byte character. However, in that string it appears on its own (apparently representing an apostrophe).

Share:
10,376
Jesse Shieh
Author by

Jesse Shieh

Updated on June 04, 2022

Comments

  • Jesse Shieh
    Jesse Shieh almost 2 years

    I'm dealing with unknown data and trying to insert into a MySQL database using Python/Django. I'm getting some errors that I don't quite understand and am looking for some help. Here is the error.

    Incorrect string value: '\xEF\xBF\xBDs m...'
    

    My guess is that the string is not being properly converted to unicode? Here is my code for unicode conversion.

    s = unicode(content, "utf-8", errors="replace")
    

    Without the above unicode conversion, the error I get is

    'utf8' codec can't decode byte 0x92 in position 31: unexpected code byte. You passed in 'Fabulous home on one of Decatur\x92s most
    

    Any help is appreciated!