Saving utf-8 texts with json.dumps as UTF8, not as \u escape sequence
Solution 1
Use the ensure_ascii=False
switch to json.dumps()
, then encode the value to UTF-8 manually:
>>> json_string = json.dumps("ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
b'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print(json_string.decode())
"ברי צקלה"
If you are writing to a file, just use json.dump()
and leave it to the file object to encode:
with open('filename', 'w', encoding='utf8') as json_file:
json.dump("ברי צקלה", json_file, ensure_ascii=False)
Caveats for Python 2
For Python 2, there are some more caveats to take into account. If you are writing this to a file, you can use io.open()
instead of open()
to produce a file object that encodes Unicode values for you as you write, then use json.dump()
instead to write to that file:
with io.open('filename', 'w', encoding='utf8') as json_file:
json.dump(u"ברי צקלה", json_file, ensure_ascii=False)
Do note that there is a bug in the json
module where the ensure_ascii=False
flag can produce a mix of unicode
and str
objects. The workaround for Python 2 then is:
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(u"ברי צקלה", ensure_ascii=False)
# unicode(data) auto-decodes data to unicode if str
json_file.write(unicode(data))
In Python 2, when using byte strings (type str
), encoded to UTF-8, make sure to also set the encoding
keyword:
>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}
>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה
Solution 2
To write to a file
import codecs
import json
with codecs.open('your_file.txt', 'w', encoding='utf-8') as f:
json.dump({"message":"xin chào việt nam"}, f, ensure_ascii=False)
To print to stdout
import json
print(json.dumps({"message":"xin chào việt nam"}, ensure_ascii=False))
Solution 3
Peters' python 2 workaround fails on an edge case:
d = {u'keyword': u'bad credit \xe7redit cards'}
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(d, ensure_ascii=False).decode('utf8')
try:
json_file.write(data)
except TypeError:
# Decode data to Unicode first
json_file.write(data.decode('utf8'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 25: ordinal not in range(128)
It was crashing on the .decode('utf8') part of line 3. I fixed the problem by making the program much simpler by avoiding that step as well as the special casing of ascii:
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(d, ensure_ascii=False, encoding='utf8')
json_file.write(unicode(data))
cat filename
{"keyword": "bad credit çredit cards"}
Solution 4
UPDATE: This is wrong answer, but it's still useful to understand why it's wrong. See comments.
How about unicode-escape
?
>>> d = {1: "ברי צקלה", 2: u"ברי צקלה"}
>>> json_str = json.dumps(d).decode('unicode-escape').encode('utf8')
>>> print json_str
{"1": "ברי צקלה", "2": "ברי צקלה"}
Solution 5
As of Python 3.7 the following code works fine:
from json import dumps
result = {"symbol": "ƒ"}
json_string = dumps(result, sort_keys=True, indent=2, ensure_ascii=False)
print(json_string)
Output:
{"symbol": "ƒ"}
Berry Tsakala
Updated on December 10, 2021Comments
-
Berry Tsakala almost 2 years
Sample code:
>>> import json >>> json_string = json.dumps("ברי צקלה") >>> print(json_string) "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"
The problem: it's not human readable. My (smart) users want to verify or even edit text files with JSON dumps (and I’d rather not use XML).
Is there a way to serialize objects into UTF-8 JSON strings (instead of
\uXXXX
)? -
jfs almost 10 yearsno. Don't do it. Modifying default character encoding has nothing to do with
json
'sensure_ascii=False
. Provide a minimal complete code example if you think otherwise. -
Martijn Pieters over 9 yearsYou only get this exception if you either feed in non-ASCII byte strings (e.g. not Unicode values) or try to combine the resulting JSON value (a Unicode string) with a non-ASCII byte string. Setting the default encoding to UTF-8 is essentially masking an underlying problem were you are not managing your string data properly.
-
Martijn Pieters almost 9 yearsThe 'edge case' was simply a dumb untested error on my part. Your
unicode(data)
approach is the better option rather than using exception handling. Note that theencoding='utf8'
keyword argument has nothing to do with the output thatjson.dumps()
produces; it is used for decodingstr
input the function receives. -
jfs over 8 years@MartijnPieters: or simpler:
open('filename', 'wb').write(json.dumps(d, ensure_ascii=False).encode('utf8'))
It works whetherdumps
returns (ascii-only) str or unicode object. -
Martijn Pieters over 8 years@J.F.Sebastian: right, because
str.encode('utf8')
decodes implicitly first. But so doesunicode(data)
, if given astr
object. :-) Usingio.open()
gives you more options though, including using a codec that writes a BOM and you are following the JSON data with something else. -
jfs over 8 years@MartijnPieters:
.encode('utf8')
-based variant works on both Python 2 and 3 (the same code). There is nounicode
on Python 3. Unrelated: json files should not use BOM (though a confirming json parser may ignore BOM, see errate 3983). -
jfs over 8 years
unicode-escape
is not necessary: you could usejson.dumps(d, ensure_ascii=False).encode('utf8')
instead. And it is not guaranteed that json uses exactly the same rules asunicode-escape
codec in Python in all cases i.e., the result might or might not be the same in some corner case. The downvote is for an unnecessary and possibly wrong conversion. Unrelated:print json_str
works only for utf8 locales or ifPYTHONIOENCODING
envvar specifies utf8 here (print Unicode instead). -
Martijn Pieters over 8 yearsAnother issue: any double quotes in string values will lose their escaping, so this'll result in broken JSON output.
-
Max L over 7 yearsadding
encoding='utf8'
tojson.dumps
solves the problem. P.S. I have a cyrillic text to dump -
Gank over 7 yearserror in Python3 :AttributeError: 'str' object has no attribute 'decode'
-
Worker over 7 yearsunicode-escape works fine! I would accept this answer as correct one.
-
Alex over 6 yearsSyntaxError: Non-ASCII character '\xc3' in file json-utf8.py on line 5, but no encoding declared; see python.org/dev/peps/pep-0263 for details
-
Karim Sonbol over 5 yearsThank you! I didn't realize it was that simple. You only need to be careful if the data you are converting to json is untrusted user input.
-
turingtested almost 5 years@jfs No,
json.dumps(d, ensure_ascii=False).encode('utf8')
is not working, for me at least. I'm gettingUnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position ...
-error. Theunicode-escape
variant works fine however. -
jfs almost 5 years@turingtested the error is likely in your other code. It is hard to say without a minimal complete code example that reproduces the issue.
-
Berry Tsakala over 4 yearsalso in python 3.6 (just verified).
-
AMC over 3 years
fh.close()
fh
is undefined. -
Chandan Sharma over 3 yearsIT's Corrected now. It would be
f.close()
-
igorkf over 3 yearsOnly worked for me using the
codecs
library. Thanks! -
AdamAL almost 3 yearsThe roundtrip
encode
/decode
doesn't seem to be necessary. Just settingensure_ascii=False
(as per this answer) seems to be enough. -
Martijn Pieters almost 3 years@AdamAL please read my answer more thoroughly: there is no round trip in this answer, apart from a decode call that’s only there to demonstrate that the bytes value indeed contains UTF-8 encoded data. The second code snippet in my answer writes directly to a file, only setting
ensure_ascii=False
. Note: I strongly recommend against using thecodecs.open()
function; the library predatesio
and the stream implementations have a lot of unresolved issues. -
Martijn Pieters almost 3 years@igorkf that would be extremely surprising if only
codecs.open()
worked where the built-inopen()
failed. Are you using Python 2 perhaps? -
igorkf almost 3 yearsIt was a long time ago, but I was using python3.7 or 3.8
-
tripleee over 2 years@Alex That's stackoverflow.com/questions/10589620/…
-
tripleee over 2 yearsThat's silly use the library's built-in feature
ensure_ascii=False
instead of rolling your own. (But understand that saving JSON as bare UTF-8 can introduce interoperability problems, especially on Windows.) -
bluu over 2 yearsThanks for your answer, even though it's wrong in OP's case, it definitely pointed me in the right direction for serializing JSON for consumption by Postgres' COPY FROM STDIN command (this was driving me nuts !!)
-
Constantine Kurbatov almost 2 years
ensure_ascii=False
— works like a charm in my case. My use:json.dumps(unicode_raw_dict, indent=2, ensure_ascii=False)
-
Mitzi over 1 year@tripleee its not silly - its the only solution that gives the exact result which is a similar encoding to file write with utf-8.