Parsing unicode input using python json.loads

python django json unicode

62,469

Solution 1

I typecasting the string into unicode string using 'latin-1' fixed the error:

UnicodeDecodeError: 'utf16' codec can't decode byte 0x38 in 
position 6: truncated data

Fixed code:

import json

ustr_to_load = unicode(str_to_load, 'latin-1')

json.loads(ustr_to_load)

And then the error is not thrown.

Solution 2

The OP clarifies (in a comment!)...:

Source data is huge unicode encoded string

Then you have to know which of the many unicode encodings it uses -- clearly not 'utf-16', since that failed, but there are so many others -- 'utf-8', 'iso-8859-15', and so forth. You either try them all until one works, or print repr(str_to_load[:80]) and paste what it shows as an edit of your question, so we can guess on your behalf!-).

Solution 3

The simplest way I have found is

import simplejson as json

that way your code remains the same

json.loads(str_to_load)

reference: https://simplejson.readthedocs.org/en/latest/

Solution 4

With django you can use SimpleJSON and use loads instead of just load.

from django.utils import simplejson

simplejson.loads(str_to_load, "utf-8")

View more solutions

62,469

Author by

Software Enthusiastic

Updated on July 09, 2022

Comments

Software Enthusiastic almost 2 years

What is the best way to load JSON Strings in Python?

I want to use json.loads to process unicode like this:

import json
json.loads(unicode_string_to_load)

I also tried supplying 'encoding' parameter with value 'utf-16', but the error did not go away.

Full SSCCE with error:

# -*- coding: utf-8 -*-
import json
value = '{"foo" : "bar"}'
print(json.loads(value)['foo'])     #This is correct, prints 'bar'

some_unicode = unicode("degradé")  
#last character is latin e with acute "\xe3\xa9"
value = '{"foo" : "' + some_unicode + '"}'
print(json.loads(value)['foo'])            #incorrect, throws error

Error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 
6: ordinal not in range(128)

Software Enthusiastic about 14 years

It is difficult to identify particular encoding during load because source data may contain characters from various languages of the world. Is there any way to detect encoding type?
Software Enthusiastic about 14 years

str_to_load keeps on changing, utf-8 worked for some, utf-32 worked for some... but how do I auto detect it?
Software Enthusiastic about 14 years

That string is '{"successful":true, "data":[76,{"posting_id":"1753178","site_tender_id":"3188446‌'
Alex Martelli about 14 years

To try and guess the encoding of a byte string -- try chardet.feedparser.org . The string you show is ASCII (which is also valid utf-8 by definition, and also valid iso-8859-1, etc: ASCII is the common subset of most encodings!) so it's impossible to guess what potential non-ASCII encoding it might be in. UnicodeDecodeError messages carry the exact index of the first problematic byte, so show the repr of the 80-long byte string centered on that index when you do get an error.
Software Enthusiastic about 14 years

When I read the entire string, I found unicode characters, have a look at it in the next string... "Lucaya, Grand Bahama; 4 Bedroom, 3 \xbd Bathroom"
Amit Patil about 14 years

The encoding of the string depends where you got it from. That string is probably one of ISO-8859-1 or Windows code page 1252. If your string is coming from a form submission from a web page, it will be in the same encoding as that web page. You really want to be using UTF-8 if you have any say in the matter. You can also avoid all charset problems by getting the JSON encoder to write non-ASCII commands using the JavaScipt \u escape; Python's json.dump does this by default but JavaScript's JSON.stringify does not.
Alex Martelli about 14 years

\xbd in ISO-8859-x for several values of x (and other encodings such as CP-1252) is the single character representing the fraction 1/2 so your encoding's likely to be among this group.
Alex Martelli about 14 years

BTW, latin-1 is the old name for iso-8859-1 and these days you're much more likely to see iso-8859-15 -- the only difference is that the latter includes the Euro sign. If you decode with -1 and the string was encoded with -15 it will mostly be OK but Euro signs will look very peculiar when you print or show them.
Griff about 8 years

this no longer works in django as it uses the default that comes with python