Using json.dumps with ensure_ascii=True

19,816

Writing up thanks to @user2357112

First thing is to understand there is no binary representation in JSON. Therefore all strings should be valid unicode points. If you are trying to json.dumps raw bytes you might be doing something wrong.

Then check:

Which makes me assume that:

  • When you are encoding text into json and all your strings are in unicode it is fine to use ensure_ascii=False, but it might actually make more sense to leave it to true and decode the str. (As per specification dumps doesnt guarantee unicode back, though it does return it if you pass unicode.
  • If you are working with str objects, calling ensure_ascii=False will prevent json from transforming your chars to unicode. You might think you want that but if you then try to read those in the browser for example weird things might happen

About how ensure_ascii impacts the result, this is a table that might help.

+-----------------------+--------------+------------------------------+
|         Input         | Ensure_ascii |            output            |
+-----------------------+--------------+------------------------------+
| u”汉语”                | True         | '"\\u6c49\\u8bed"'           |
| u”汉语”                | False        | u'"\u6c49\u8bed"'            |
| u”汉语".encode("utf-8")| True         | '"\\u6c49\\u8bed"’           |
| u”汉语".encode("utf-8")| False        | '"\xe6\xb1\x89\xe8\xaf\xad"' |
+-----------------------+--------------+------------------------------+

Note the last value is utf-8 encoded unicode into bytes. Which might be not parseable by other json decoders.

Moreover If you mix types(Array of unicode and str) and use ensure_ascii=False you can get an UnicodeDecodeErrror (When encoding into json, mindblending) as the module will to return you a unicode object but it wont be able to convert the str into unicode using the default encoding (ascii)

Share:
19,816
Mario Corchero
Author by

Mario Corchero

C++ Software Engineer at Bloomberg LP. My wider grasp resides in C++, OO PL and Databases. Although I am still truly newbie! As a hobby I have developed some Web using mainly HTML5 and JQuery with some PHP, .Net or Django on server side. I really love software design, architecture and C++ crazy caveats!

Updated on June 03, 2022

Comments

  • Mario Corchero
    Mario Corchero almost 2 years

    When using json.dumps the default for ensure_ascii is True but I see myself continuously setting it to False as:

    • If I work with unicode I need to pass it or I'll get str back
    • If I work with str I need to pass it so my chars don't get converted to unicode (encoded within a str)

    In which scenarios would you want it to be True? What is the usecase for that option?

    From the Docs:

    If ensure_ascii is true (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences, and the results are str instances consisting of ASCII characters only.

    What is the benefit of it?

    • user2357112
      user2357112 over 7 years
      "If I work with unicode I need to pass it or I'll get str back" - you might get a str back anyway. ensure_ascii=False doesn't promise that the type of the result is unicode.
    • user2357112
      user2357112 over 7 years
      "If I work with str I need to pass it so my chars don't get converted to unicode (encoded within a str)" - you're turning on the option that allows a unicode return value, in the hopes that it will prevent you from getting Unicode output? Are you sure you understand what this thing does?
    • Mario Corchero
      Mario Corchero over 7 years
      >>> json.dumps(u"a", ensure_ascii=False) u'"a"' I assumed the "return value may be a unicode instance" meant if you pass me unicode I give you unicode but it might not be that. Is there a clear guide of what to pass and what to expect?
    • Mario Corchero
      Mario Corchero over 7 years
      To your second comment: I am setting it to false so if I have str (working with bytes objects) I wont get the failed to UnicodeDecodeError. Ex: json.dumps("\x99")
    • user2357112
      user2357112 over 7 years
      The JSON format has no bytestring type; all strings are Unicode, and thus, json.dumps assumes that all str instances are meant to represent Unicode text, encoded through some encoding. The default encoding is UTF8, in which '\x99' is not a valid encoded string. If you want to serialize arbitrary bytestrings to JSON, you should set encoding='latin-1' to map the bytes to Unicode codepoints 1-1 instead of changing the ensure_ascii setting.
    • user2357112
      user2357112 over 7 years
      As for whether the return value is unicode or str with ensure_ascii=False, there are no guarantees, so you should basically just call unicode on the return value to guarantee you have a unicode object.
    • Mario Corchero
      Mario Corchero over 7 years
      That is quite different from what I expected. But if it works like that, And then I have to check if it is Unicode and decode if necessary every time, wouldn't it just be better to leave it to true and always call decode on the result? (If I want to work only with Unicode on my program)
    • user2357112
      user2357112 over 7 years
      The results would be different. With ensure_ascii=True, all non-ASCII characters get encoded with \u escapes to ensure that all characters in the JSON output are in the ASCII range. You should decide what value of ensure_ascii to use based on whether you want that to happen.
  • Spike
    Spike almost 6 years
    Hi, all strings should be valid unicode points (in Json) and Ensuring your output is valid ascii characters (which are going to be writen to a Json file or else) make me confused: ascii chars are not code points, they are binary, am I mistaken anything?@Mario