How to encode HTML non-ASCII data to UTF-8 in Python
10,288
You need to know how the input data is encoded before you decode it. In some of you're attempts, you're trying to decode it from UTF-8, but Python throws an exception because the input isn't valid UTF-8. It looks like it might be latin-1. This works for me:
>>> x = 'Ingl\xeas'
>>> print x.decode('latin1')
Inglês
You mention "non-ASCII HTML". If you're writing a web server script and you're getting data from an HTTP request, you should check the Content-Type header. In an ideal world, it will tell you which encoding the client is using for the data. Keep in mind that the client may be working incorrectly.
Hope that helps!
Comments
-
Ivan Rocha almost 2 years
I tried to do that, and I found this errors:
>>> import re >>> x = 'Ingl\xeas' >>> x 'Ingl\xeas' >>> print x Ingl�s >>> x.decode('utf8') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-5: unexpected end of data >>> x.decode('utf8', 'ignore') u'Ingl' >>> x.decode('utf8', 'replace') u'Ingl\ufffd' >>> print x.decode('utf8', 'replace') Ingl� >>> print x.decode('utf8', 'xmlcharrefreplace') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) TypeError: don't know how to handle UnicodeDecodeError in error callback
When I use the print statement, I want that:
>>> print x u'Inglês'
Any help is welcome.
-
Mike Graham about 14 yearsPython 3 has two string types, just like Python 2. 3's
str
is 2'sunicode
with trivial modifications. 3'sbytes
is 2'sstr
with moderate modifications. -
Tim Pietzcker about 14 years@Daniel: Not in the interactive shell.
-
Daniel Stutzbach about 14 yearsit does for me. I guess it depends on how the installation is set up? I get: UnicodeEncodeError: 'ascii' codec can't encode character '\xea' in position 4: ordinal not in range(128)
-
Tim Pietzcker about 14 yearsOh, it might have to do with the local environment. I'm on Windows, therefore the interactive shell's encoding is Windows-1252. Under Linux, it might be UTF-8. Will edit my post.