json.dumps \u escaped unicode to utf8
Solution 1
You have UTF-8 JSON data:
>>> import json
>>> data = {'content': u'\u4f60\u597d'}
>>> json.dumps(data, indent=1, ensure_ascii=False)
u'{\n "content": "\u4f60\u597d"\n}'
>>> json.dumps(data, indent=1, ensure_ascii=False).encode('utf8')
'{\n "content": "\xe4\xbd\xa0\xe5\xa5\xbd"\n}'
>>> print json.dumps(data, indent=1, ensure_ascii=False).encode('utf8')
{
"content": "你好"
}
My terminal just happens to be configured to handle UTF-8, so printing the UTF-8 bytes to my terminal produced the desired output.
However, if your terminal is not set up for such output, it is your terminal that then shows 'wrong' characters:
>>> print json.dumps(data, indent=1, ensure_ascii=False).encode('utf8').decode('latin1')
{
"content": "ä½ å¥½"
}
Note how I decoded the data to Latin-1 to deliberately mis-read the UTF-8 bytes.
This isn't a Python problem; this is a problem with how you are handling the UTF-8 bytes in whatever tool you used to read these bytes.
Solution 2
in python2, it works; however in python3 print
will output like:
>>> b'{\n "content": "\xe4\xbd\xa0\xe5\xa5\xbd"\n}'
do not use encode('utf8')
:
>>> print(json.dumps(data, indent=1, ensure_ascii=False))
{
"content": "你好"
}
or use sys.stdout.buffer.write
instead of print
:
>>> import sys
>>> import json
>>> data = {'content': u'\u4f60\u597d'}
>>> sys.stdout.buffer.write(json.dumps(data, indent=1,
ensure_ascii=False).encode('utf8') + b'\n')
{
"content": "你好"
}
see Write UTF-8 to stdout, regardless of the console's encoding
Related videos on Youtube
Bonk
Updated on April 30, 2021Comments
-
Bonk over 2 years
I came from this old discussion, but the solution didn't help much as my original data was encoded differently:
My original data was already encoded in unicode, I need to output as UTF-8
data={"content":u"\u4f60\u597d"}
When I try to convert to utf:
json.dumps(data, indent=1, ensure_ascii=False).encode("utf8")
the output I get is
"content": "ä½ å¥½"
and the expected out put should be"content": "你好"
I tried without
ensure_ascii=false
and the output becomes plain unescaped"content": "\u4f60\u597d"
How can I convert the previously \u escaped json to UTF-8?
-
Martijn Pieters over 7 yearsYou are reading your UTF-8 data in the wrong codec. You have UTF-8, but are decoding it as Latin-1 or CP1252. In other words, this is not a Python problem.
-
David Grayson over 7 yearsYeah, I was unable to repreoduce this problem in the Python 3 REPL.
-
-
Bonk over 7 yearsThank you, it was my browser that's acting up. I thought the
ä½ å¥½
was encoding error on Python end. Turns out it's the output :) -
Martijn Pieters over 7 years@Bonk: perhaps you need to set a proper response header?
Content-Type: application/json
should be enough (as the JSON standard specifies that UTF is the default, with a BOM at the start making it possible to distinguish UTF-8 from UTF-16 and UTF-32), or include the charset explicitly withContent-Type: application/json; charset=utf8
. Without aContent-Type
header or with one set to atext/..
mimetype the browser may well default to Latin-1.