How to encode HTML non-ASCII data to UTF-8 in Python

python unicode utf-8

10,288

You need to know how the input data is encoded before you decode it. In some of you're attempts, you're trying to decode it from UTF-8, but Python throws an exception because the input isn't valid UTF-8. It looks like it might be latin-1. This works for me:

>>> x = 'Ingl\xeas'
>>> print x.decode('latin1')
Inglês

You mention "non-ASCII HTML". If you're writing a web server script and you're getting data from an HTTP request, you should check the Content-Type header. In an ideal world, it will tell you which encoding the client is using for the data. Keep in mind that the client may be working incorrectly.

Hope that helps!

10,288

Author by

Ivan Rocha

Foo Bar.

Updated on June 04, 2022

Comments

Ivan Rocha almost 2 years

I tried to do that, and I found this errors:

>>> import re  
>>> x = 'Ingl\xeas'  
>>> x  
'Ingl\xeas'  
>>> print x  
Ingl�s  
>>> x.decode('utf8')  
Traceback (most recent call last):  
    File "<stdin>", line 1, in <module>  
    File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode  
        return codecs.utf_8_decode(input, errors, True)  
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-5: unexpected end of data  
>>> x.decode('utf8', 'ignore')  
u'Ingl'  
>>> x.decode('utf8', 'replace')  
u'Ingl\ufffd'  
>>> print x.decode('utf8', 'replace')  
Ingl�  
>>> print x.decode('utf8', 'xmlcharrefreplace')  
Traceback (most recent call last):  
    File "<stdin>", line 1, in <module>  
    File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode  
        return codecs.utf_8_decode(input, errors, True)  
TypeError: don't know how to handle UnicodeDecodeError in error callback

When I use the print statement, I want that:

>>> print x  
u'Inglês'

Any help is welcome.

Mike Graham about 14 years

Python 3 has two string types, just like Python 2. 3's str is 2's unicode with trivial modifications. 3's bytes is 2's str with moderate modifications.
Tim Pietzcker about 14 years

@Daniel: Not in the interactive shell.
Daniel Stutzbach about 14 years

it does for me. I guess it depends on how the installation is set up? I get: UnicodeEncodeError: 'ascii' codec can't encode character '\xea' in position 4: ordinal not in range(128)
Tim Pietzcker about 14 years

Oh, it might have to do with the local environment. I'm on Windows, therefore the interactive shell's encoding is Windows-1252. Under Linux, it might be UTF-8. Will edit my post.