Parsing unicode input using python json.loads

62,469

Solution 1

I typecasting the string into unicode string using 'latin-1' fixed the error:

UnicodeDecodeError: 'utf16' codec can't decode byte 0x38 in 
position 6: truncated data

Fixed code:

import json

ustr_to_load = unicode(str_to_load, 'latin-1')

json.loads(ustr_to_load)

And then the error is not thrown.

Solution 2

The OP clarifies (in a comment!)...:

Source data is huge unicode encoded string

Then you have to know which of the many unicode encodings it uses -- clearly not 'utf-16', since that failed, but there are so many others -- 'utf-8', 'iso-8859-15', and so forth. You either try them all until one works, or print repr(str_to_load[:80]) and paste what it shows as an edit of your question, so we can guess on your behalf!-).

Solution 3

The simplest way I have found is

import simplejson as json

that way your code remains the same

json.loads(str_to_load)

reference: https://simplejson.readthedocs.org/en/latest/

Solution 4

With django you can use SimpleJSON and use loads instead of just load.

from django.utils import simplejson

simplejson.loads(str_to_load, "utf-8")
Share:
62,469
Software Enthusiastic
Author by

Software Enthusiastic

Updated on July 09, 2022

Comments

  • Software Enthusiastic
    Software Enthusiastic almost 2 years

    What is the best way to load JSON Strings in Python?

    I want to use json.loads to process unicode like this:

    import json
    json.loads(unicode_string_to_load)
    

    I also tried supplying 'encoding' parameter with value 'utf-16', but the error did not go away.

    Full SSCCE with error:

    # -*- coding: utf-8 -*-
    import json
    value = '{"foo" : "bar"}'
    print(json.loads(value)['foo'])     #This is correct, prints 'bar'
    
    some_unicode = unicode("degradé")  
    #last character is latin e with acute "\xe3\xa9"
    value = '{"foo" : "' + some_unicode + '"}'
    print(json.loads(value)['foo'])            #incorrect, throws error
    

    Error:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 
    6: ordinal not in range(128)
    
  • Software Enthusiastic
    Software Enthusiastic about 14 years
    It is difficult to identify particular encoding during load because source data may contain characters from various languages of the world. Is there any way to detect encoding type?
  • Software Enthusiastic
    Software Enthusiastic about 14 years
    str_to_load keeps on changing, utf-8 worked for some, utf-32 worked for some... but how do I auto detect it?
  • Software Enthusiastic
    Software Enthusiastic about 14 years
    That string is '{"successful":true, "data":[76,{"posting_id":"1753178","site_tender_id":"3188446‌​'
  • Alex Martelli
    Alex Martelli about 14 years
    To try and guess the encoding of a byte string -- try chardet.feedparser.org . The string you show is ASCII (which is also valid utf-8 by definition, and also valid iso-8859-1, etc: ASCII is the common subset of most encodings!) so it's impossible to guess what potential non-ASCII encoding it might be in. UnicodeDecodeError messages carry the exact index of the first problematic byte, so show the repr of the 80-long byte string centered on that index when you do get an error.
  • Software Enthusiastic
    Software Enthusiastic about 14 years
    When I read the entire string, I found unicode characters, have a look at it in the next string... "Lucaya, Grand Bahama; 4 Bedroom, 3 \xbd Bathroom"
  • Amit Patil
    Amit Patil about 14 years
    The encoding of the string depends where you got it from. That string is probably one of ISO-8859-1 or Windows code page 1252. If your string is coming from a form submission from a web page, it will be in the same encoding as that web page. You really want to be using UTF-8 if you have any say in the matter. You can also avoid all charset problems by getting the JSON encoder to write non-ASCII commands using the JavaScipt \u escape; Python's json.dump does this by default but JavaScript's JSON.stringify does not.
  • Alex Martelli
    Alex Martelli about 14 years
    \xbd in ISO-8859-x for several values of x (and other encodings such as CP-1252) is the single character representing the fraction 1/2 so your encoding's likely to be among this group.
  • Alex Martelli
    Alex Martelli about 14 years
    BTW, latin-1 is the old name for iso-8859-1 and these days you're much more likely to see iso-8859-15 -- the only difference is that the latter includes the Euro sign. If you decode with -1 and the string was encoded with -15 it will mostly be OK but Euro signs will look very peculiar when you print or show them.
  • Griff
    Griff about 8 years
    this no longer works in django as it uses the default that comes with python