how to decode an ascii string with backslash x \x codes

20,702

You have binary data that is not ASCII encoded. The \xhh codepoints indicate your data is encoded with a different codec, and you are seeing Python produce a representation of the data using the repr() function that can be re-used as a Python literal that accurately lets you re-create the exact same value. This representation is very useful when debugging a program.

In other words, the \xhh escape sequences represent individual bytes, and the hh is the hex value of that byte. You have 4 bytes with hex values C3, A7, C3 and B5, that do not map to printable ASCII characters so Python uses the \xhh notation instead.

You instead have UTF-8 data, decode it as such:

>>> 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8')
u'Demais Subfun\xe7\xf5es 12'
>>> print 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8')
Demais Subfunções 12

The C3 A7 bytes together encode U+00E7 LATIN SMALL LETTER C WITH CEDILLA, while the C3 B5 bytes encode U+00F5 LATIN SMALL LETTER O WITH TILDE.

ASCII happens to be a subset of the UTF-8 codec, which is why all the other letters can be represented as such in the Python repr() output.

Share:
20,702
Davoud Taghawi-Nejad
Author by

Davoud Taghawi-Nejad

Economist working on reinforcement learning, CGE, agent-based modeling. networks and buisness cycles.

Updated on September 25, 2020

Comments

  • Davoud Taghawi-Nejad
    Davoud Taghawi-Nejad over 3 years

    I am trying to decode from a Brazilian Portogese text:

    'Demais Subfun\xc3\xa7\xc3\xb5es 12'

    It should be

    'Demais Subfunções 12'

    >> a.decode('unicode_escape')
    >> a.encode('unicode_escape')
    >> a.decode('ascii')
    >> a.encode('ascii')
    

    all give:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13:
    ordinal not in range(128)
    

    on the other hand this gives:

    >> print a.encode('utf-8')
    Demais Subfun├â┬º├â┬Áes 12
    
    >> print a
    Demais Subfunções 12
    
  • Slake
    Slake over 5 years
    How to avoid the use of 'print', i.e: write the decoded string inside a file?
  • Martijn Pieters
    Martijn Pieters over 5 years
    @Slake: just write the decoded string to a file. Use io.open(..., encoding='...') in Python 2 to write Unicode data to a file.