How to encode HTML non-ASCII data to UTF-8 in Python

10,288

You need to know how the input data is encoded before you decode it. In some of you're attempts, you're trying to decode it from UTF-8, but Python throws an exception because the input isn't valid UTF-8. It looks like it might be latin-1. This works for me:

>>> x = 'Ingl\xeas'
>>> print x.decode('latin1')
Inglês

You mention "non-ASCII HTML". If you're writing a web server script and you're getting data from an HTTP request, you should check the Content-Type header. In an ideal world, it will tell you which encoding the client is using for the data. Keep in mind that the client may be working incorrectly.

Hope that helps!

Share:
10,288
Ivan Rocha
Author by

Ivan Rocha

Foo Bar.

Updated on June 04, 2022

Comments

  • Ivan Rocha
    Ivan Rocha almost 2 years

    I tried to do that, and I found this errors:

    >>> import re  
    >>> x = 'Ingl\xeas'  
    >>> x  
    'Ingl\xeas'  
    >>> print x  
    Ingl�s  
    >>> x.decode('utf8')  
    Traceback (most recent call last):  
        File "<stdin>", line 1, in <module>  
        File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode  
            return codecs.utf_8_decode(input, errors, True)  
    UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-5: unexpected end of data  
    >>> x.decode('utf8', 'ignore')  
    u'Ingl'  
    >>> x.decode('utf8', 'replace')  
    u'Ingl\ufffd'  
    >>> print x.decode('utf8', 'replace')  
    Ingl�  
    >>> print x.decode('utf8', 'xmlcharrefreplace')  
    Traceback (most recent call last):  
        File "<stdin>", line 1, in <module>  
        File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode  
            return codecs.utf_8_decode(input, errors, True)  
    TypeError: don't know how to handle UnicodeDecodeError in error callback  
    

    When I use the print statement, I want that:

    >>> print x  
    u'Inglês'  
    

    Any help is welcome.

  • Mike Graham
    Mike Graham about 14 years
    Python 3 has two string types, just like Python 2. 3's str is 2's unicode with trivial modifications. 3's bytes is 2's str with moderate modifications.
  • Tim Pietzcker
    Tim Pietzcker about 14 years
    @Daniel: Not in the interactive shell.
  • Daniel Stutzbach
    Daniel Stutzbach about 14 years
    it does for me. I guess it depends on how the installation is set up? I get: UnicodeEncodeError: 'ascii' codec can't encode character '\xea' in position 4: ordinal not in range(128)
  • Tim Pietzcker
    Tim Pietzcker about 14 years
    Oh, it might have to do with the local environment. I'm on Windows, therefore the interactive shell's encoding is Windows-1252. Under Linux, it might be UTF-8. Will edit my post.