Parsing unicode input using python json.loads
Solution 1
I typecasting the string into unicode string using 'latin-1' fixed the error:
UnicodeDecodeError: 'utf16' codec can't decode byte 0x38 in
position 6: truncated data
Fixed code:
import json
ustr_to_load = unicode(str_to_load, 'latin-1')
json.loads(ustr_to_load)
And then the error is not thrown.
Solution 2
The OP clarifies (in a comment!)...:
Source data is huge unicode encoded string
Then you have to know which of the many unicode encodings it uses -- clearly not 'utf-16', since that failed, but there are so many others -- 'utf-8', 'iso-8859-15', and so forth. You either try them all until one works, or print repr(str_to_load[:80])
and paste what it shows as an edit of your question, so we can guess on your behalf!-).
Solution 3
The simplest way I have found is
import simplejson as json
that way your code remains the same
json.loads(str_to_load)
reference: https://simplejson.readthedocs.org/en/latest/
Solution 4
With django you can use SimpleJSON and use loads instead of just load.
from django.utils import simplejson
simplejson.loads(str_to_load, "utf-8")
Software Enthusiastic
Updated on July 09, 2022Comments
-
Software Enthusiastic almost 2 years
What is the best way to load JSON Strings in Python?
I want to use json.loads to process unicode like this:
import json json.loads(unicode_string_to_load)
I also tried supplying 'encoding' parameter with value 'utf-16', but the error did not go away.
Full SSCCE with error:
# -*- coding: utf-8 -*- import json value = '{"foo" : "bar"}' print(json.loads(value)['foo']) #This is correct, prints 'bar' some_unicode = unicode("degradé") #last character is latin e with acute "\xe3\xa9" value = '{"foo" : "' + some_unicode + '"}' print(json.loads(value)['foo']) #incorrect, throws error
Error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
-
Software Enthusiastic about 14 yearsIt is difficult to identify particular encoding during load because source data may contain characters from various languages of the world. Is there any way to detect encoding type?
-
Software Enthusiastic about 14 yearsstr_to_load keeps on changing, utf-8 worked for some, utf-32 worked for some... but how do I auto detect it?
-
Software Enthusiastic about 14 yearsThat string is '{"successful":true, "data":[76,{"posting_id":"1753178","site_tender_id":"3188446'
-
Alex Martelli about 14 yearsTo try and guess the encoding of a byte string -- try chardet.feedparser.org . The string you show is ASCII (which is also valid utf-8 by definition, and also valid iso-8859-1, etc: ASCII is the common subset of most encodings!) so it's impossible to guess what potential non-ASCII encoding it might be in. UnicodeDecodeError messages carry the exact index of the first problematic byte, so show the repr of the 80-long byte string centered on that index when you do get an error.
-
Software Enthusiastic about 14 yearsWhen I read the entire string, I found unicode characters, have a look at it in the next string... "Lucaya, Grand Bahama; 4 Bedroom, 3 \xbd Bathroom"
-
Amit Patil about 14 yearsThe encoding of the string depends where you got it from. That string is probably one of ISO-8859-1 or Windows code page 1252. If your string is coming from a form submission from a web page, it will be in the same encoding as that web page. You really want to be using UTF-8 if you have any say in the matter. You can also avoid all charset problems by getting the JSON encoder to write non-ASCII commands using the JavaScipt
\u
escape; Python'sjson.dump
does this by default but JavaScript'sJSON.stringify
does not. -
Alex Martelli about 14 years
\xbd
in ISO-8859-x for several values of x (and other encodings such as CP-1252) is the single character representing the fraction1/2
so your encoding's likely to be among this group. -
Alex Martelli about 14 yearsBTW,
latin-1
is the old name foriso-8859-1
and these days you're much more likely to seeiso-8859-15
-- the only difference is that the latter includes the Euro sign. If you decode with-1
and the string was encoded with-15
it will mostly be OK but Euro signs will look very peculiar when you print or show them. -
Griff about 8 yearsthis no longer works in django as it uses the default that comes with python