Convert GBK to utf8 string in python
12,672
Solution 1
in python2, try this to convert your unicode string:
>>> s.encode('latin-1').decode('gbk')
u"<script language=javascript>alert('\u8bf7\u8f93\u5165\u6b63\u786e\u9a8c\u8bc1\u7801,\u8c22\u8c22!');location='index.asp';</script></script>"
then you can encode to utf-8 as you wish.
>>> s.encode('latin-1').decode('gbk').encode('utf-8')
"<script language=javascript>alert('\xe8\xaf\xb7\xe8\xbe\x93\xe5\x85\xa5\xe6\xad\xa3\xe7\xa1\xae\xe9\xaa\x8c\xe8\xaf\x81\xe7\xa0\x81,\xe8\xb0\xa2\xe8\xb0\xa2!');location='index.asp';</script></script>"
Solution 2
You are mixing apples and oranges. The GBK-encoded string is not a Unicode string and should hence not end up in a u'...'
string.
This is the correct way to do it in Python 2.
g = '\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,' \
'\xd0\xbb\xd0\xbb!'.decode('gbk')
s = u"<script language=javascript>alert(" + g +
u");location='index.asp';</script></script>"
Notice how the initializer for g
which is passed to .decode('gbk')
is not represented as a Unicode string, but as a plain byte string.
See also http://nedbatchelder.com/text/unipain.html
Author by
amazingjxq
Updated on July 27, 2022Comments
-
amazingjxq over 1 year
I have a string.
s = u"<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
How can I translate
s
into a utf-8 string? I have trieds.decode('gbk').encode('utf-8')
but python reports error:UnicodeEncodeError: 'ascii' codec can't encode characters in position 35-50: ordinal not in range(128)
-
Ivaylo about 10 years@amazingjxq: In the second method pay attention, that the string s is plain, not s=u''.
-
tripleee over 9 yearsThe detour over
latin-1
is shocking. Yes, it's a workaround, but that is really not how you do it. -
s16h over 9 yearsCould down-voter please explain why he/she down-voted so I can learn too? Thanks.
-
Mark Ransom over 7 years@tripleee not shocking at all once you know the mechanics behind it. Unicode used Latin-1 as its base for the first 256 codepoints, so if you need those codepoints as bytes it's a 1:1 mapping. Obviously it's better to get the decoding done properly in the first place, but sometimes with Mojibake that's impossible.