json.dumps \u escaped unicode to utf8

20,536

Solution 1

You have UTF-8 JSON data:

>>> import json
>>> data = {'content': u'\u4f60\u597d'}
>>> json.dumps(data, indent=1, ensure_ascii=False)
u'{\n "content": "\u4f60\u597d"\n}'
>>> json.dumps(data, indent=1, ensure_ascii=False).encode('utf8')
'{\n "content": "\xe4\xbd\xa0\xe5\xa5\xbd"\n}'
>>> print json.dumps(data, indent=1, ensure_ascii=False).encode('utf8')
{
 "content": "你好"
}

My terminal just happens to be configured to handle UTF-8, so printing the UTF-8 bytes to my terminal produced the desired output.

However, if your terminal is not set up for such output, it is your terminal that then shows 'wrong' characters:

>>> print json.dumps(data, indent=1,  ensure_ascii=False).encode('utf8').decode('latin1')
{
 "content": "你好"
}

Note how I decoded the data to Latin-1 to deliberately mis-read the UTF-8 bytes.

This isn't a Python problem; this is a problem with how you are handling the UTF-8 bytes in whatever tool you used to read these bytes.

Solution 2

in python2, it works; however in python3 print will output like:

>>> b'{\n "content": "\xe4\xbd\xa0\xe5\xa5\xbd"\n}'

do not use encode('utf8'):

>>> print(json.dumps(data, indent=1, ensure_ascii=False))
{
 "content": "你好"
}

or use sys.stdout.buffer.write instead of print:

>>> import sys
>>> import json
>>> data = {'content': u'\u4f60\u597d'}
>>> sys.stdout.buffer.write(json.dumps(data, indent=1, 
ensure_ascii=False).encode('utf8') + b'\n')
{
 "content": "你好"
}

see Write UTF-8 to stdout, regardless of the console's encoding

Share:
20,536

Related videos on Youtube

Bonk
Author by

Bonk

Updated on April 30, 2021

Comments

  • Bonk
    Bonk over 2 years

    I came from this old discussion, but the solution didn't help much as my original data was encoded differently:

    My original data was already encoded in unicode, I need to output as UTF-8

    data={"content":u"\u4f60\u597d"}
    

    When I try to convert to utf:

    json.dumps(data, indent=1, ensure_ascii=False).encode("utf8")
    

    the output I get is "content": "ä½ å¥½" and the expected out put should be "content": "你好"

    I tried without ensure_ascii=false and the output becomes plain unescaped "content": "\u4f60\u597d"

    How can I convert the previously \u escaped json to UTF-8?

    • Martijn Pieters
      Martijn Pieters over 7 years
      You are reading your UTF-8 data in the wrong codec. You have UTF-8, but are decoding it as Latin-1 or CP1252. In other words, this is not a Python problem.
    • David Grayson
      David Grayson over 7 years
      Yeah, I was unable to repreoduce this problem in the Python 3 REPL.
  • Bonk
    Bonk over 7 years
    Thank you, it was my browser that's acting up. I thought the ä½ å¥½ was encoding error on Python end. Turns out it's the output :)
  • Martijn Pieters
    Martijn Pieters over 7 years
    @Bonk: perhaps you need to set a proper response header? Content-Type: application/json should be enough (as the JSON standard specifies that UTF is the default, with a BOM at the start making it possible to distinguish UTF-8 from UTF-16 and UTF-32), or include the charset explicitly with Content-Type: application/json; charset=utf8. Without a Content-Type header or with one set to a text/.. mimetype the browser may well default to Latin-1.