Unicode latin1 string encode / decode

13,179

I guess the string has been incorrectly converted to lowercase at some point, changing \xc3 to \xe3. The lowercase conversion has assumed latin1 encoding when it was actually utf-8.

>>> print 'gr\xc3\xa9gory'.decode('utf8')
grégory
Share:
13,179
user3203201
Author by

user3203201

Updated on June 04, 2022

Comments

  • user3203201
    user3203201 almost 2 years

    While fetching data from an unknown/old/non-consistent Mysql database to a Postgres utf-8 db using Python (Django) ORM I have sometimes faulty encoded data as a result.

    Target: grégory

    > a
    u'gr\xe3\xa9gory'
    
    > print a
    grã©gory
    

    I tried several decode/encode tricks without success:

     > print a.encode('utf-8').decode('latin1')
     grã©gory
    
     > print a.encode('utf-8').decode('latin1')
     grã©gory
    
     > print a.decode('latin-1')
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)
    

    Even with some unicode_escape