Convert unicode with utf-8 string as content to str
If you have a unicode
value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':
content = content.encode('latin1')
because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.
For your example this gives me:
>>> content = u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1')
'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1').decode('utf8')
u'\u5c42\u53e0\u6837\u5f0f\u8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表
PyQuery
uses either requests
or urllib
to retrieve the HTML, and in the case of requests
, uses the .text
attribute of the response. This auto-decodes the response data based on the encoding set in a Content-Type
header alone, or if that information is not available, uses latin-1
for this (for text responses, but HTML is a text response). You can override this by passing in an encoding
argument:
dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
{'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
at which point you'd not have to re-encode at all.
Comments
-
wong2 about 3 years
I'm using pyquery to parse a page:
dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'}) content = dom('#mw-content-text > p').eq(0).text()
but what I get in
content
is a unicode string with utf-8 encoded content:u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8...'
how could I convert it to
str
without lost the content?to make it clear:
I want
conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
not
conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
-
spatel over 10 yearsI had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
-
Martijn Pieters over 10 yearsEncoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
-
spatel over 10 yearsWell I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
-
Jacky almost 8 yearsThanks! Been tortured by the same issue for one day!
-
Rajasankar about 5 yearsthanks a lot for this workaround. I was able to convert tamil unicode to readable format.