Using json.dumps with ensure_ascii=True
Writing up thanks to @user2357112
First thing is to understand there is no binary representation in JSON. Therefore all strings should be valid unicode points. If you are trying to json.dumps raw bytes you might be doing something wrong.
Then check:
- json docs
- Some information about why ensure_ascii works as it works: issue13769
-
ensure_ascii
does two things. Ensuring your output is valid ascii characters (even if they have unicode inside) and allow the function to return an unicode object.
Which makes me assume that:
- When you are encoding text into json and all your strings are in unicode it is fine to use
ensure_ascii=False
, but it might actually make more sense to leave it to true and decode the str. (As per specification dumps doesnt guarantee unicode back, though it does return it if you pass unicode. - If you are working with str objects, calling ensure_ascii=False will prevent json from transforming your chars to unicode. You might think you want that but if you then try to read those in the browser for example weird things might happen
About how ensure_ascii impacts the result, this is a table that might help.
+-----------------------+--------------+------------------------------+
| Input | Ensure_ascii | output |
+-----------------------+--------------+------------------------------+
| u”汉语” | True | '"\\u6c49\\u8bed"' |
| u”汉语” | False | u'"\u6c49\u8bed"' |
| u”汉语".encode("utf-8")| True | '"\\u6c49\\u8bed"’ |
| u”汉语".encode("utf-8")| False | '"\xe6\xb1\x89\xe8\xaf\xad"' |
+-----------------------+--------------+------------------------------+
Note the last value is utf-8 encoded unicode into bytes. Which might be not parseable by other json decoders.
Moreover If you mix types(Array of unicode and str) and use ensure_ascii=False
you can get an UnicodeDecodeErrror
(When encoding into json, mindblending) as the module will to return you a unicode object but it wont be able to convert the str into unicode using the default encoding (ascii)
Mario Corchero
C++ Software Engineer at Bloomberg LP. My wider grasp resides in C++, OO PL and Databases. Although I am still truly newbie! As a hobby I have developed some Web using mainly HTML5 and JQuery with some PHP, .Net or Django on server side. I really love software design, architecture and C++ crazy caveats!
Updated on June 03, 2022Comments
-
Mario Corchero almost 2 years
When using
json.dumps
the default forensure_ascii
isTrue
but I see myself continuously setting it toFalse
as:- If I work with
unicode
I need to pass it or I'll get str back - If I work with
str
I need to pass it so my chars don't get converted to unicode (encoded within a str)
In which scenarios would you want it to be
True
? What is the usecase for that option?From the Docs:
If ensure_ascii is true (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences, and the results are str instances consisting of ASCII characters only.
What is the benefit of it?
-
user2357112 over 7 years"If I work with unicode I need to pass it or I'll get str back" - you might get a
str
back anyway.ensure_ascii=False
doesn't promise that the type of the result isunicode
. -
user2357112 over 7 years"If I work with str I need to pass it so my chars don't get converted to unicode (encoded within a str)" - you're turning on the option that allows a
unicode
return value, in the hopes that it will prevent you from getting Unicode output? Are you sure you understand what this thing does? -
Mario Corchero over 7 years>>> json.dumps(u"a", ensure_ascii=False) u'"a"' I assumed the "return value may be a unicode instance" meant if you pass me unicode I give you unicode but it might not be that. Is there a clear guide of what to pass and what to expect?
-
Mario Corchero over 7 yearsTo your second comment: I am setting it to false so if I have str (working with bytes objects) I wont get the failed to UnicodeDecodeError. Ex: json.dumps("\x99")
-
user2357112 over 7 yearsThe JSON format has no bytestring type; all strings are Unicode, and thus,
json.dumps
assumes that allstr
instances are meant to represent Unicode text, encoded through some encoding. The default encoding is UTF8, in which'\x99'
is not a valid encoded string. If you want to serialize arbitrary bytestrings to JSON, you should setencoding='latin-1'
to map the bytes to Unicode codepoints 1-1 instead of changing theensure_ascii
setting. -
user2357112 over 7 yearsAs for whether the return value is
unicode
orstr
withensure_ascii=False
, there are no guarantees, so you should basically just callunicode
on the return value to guarantee you have aunicode
object. -
Mario Corchero over 7 yearsThat is quite different from what I expected. But if it works like that, And then I have to check if it is Unicode and decode if necessary every time, wouldn't it just be better to leave it to true and always call decode on the result? (If I want to work only with Unicode on my program)
-
user2357112 over 7 yearsThe results would be different. With
ensure_ascii=True
, all non-ASCII characters get encoded with\u
escapes to ensure that all characters in the JSON output are in the ASCII range. You should decide what value ofensure_ascii
to use based on whether you want that to happen.
- If I work with
-
Spike almost 6 yearsHi, all strings should be valid unicode points (in Json) and Ensuring your output is valid ascii characters (which are going to be writen to a Json file or else) make me confused: ascii chars are not code points, they are binary, am I mistaken anything?@Mario