UnicodeDecodeError while using json.dumps()

33,705

Solution 1

\xe1 is not decodable using utf-8, utf-16 encoding.

>>> '\xe1'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0: unexpected end of data
>>> '\xe1'.decode('utf-16')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
    return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xe1 in position 0: truncated data

Try latin-1 encoding:

>>> record = (5790, 'Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo            ',
...           60, True, '40141613')
>>> json.dumps(record, encoding='latin1')
'[5790, "Vlv-Gate-Assy-Mdl-\\u00e1M1-2-\\u00e19/16-10K-BB Credit Memo            ", 60, true, "40141613"]'

Or, specify ensure_ascii=False, json.dumps to make json.dumps not try to decode the string.

>>> json.dumps(record, ensure_ascii=False)
'[5790, "Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo            ", 60, true, "40141613"]'

Solution 2

I had a similar problem, and came up with the following approach to either guarantee unicodes or byte strings, from either input. In short, include and use the following lambdas:

# guarantee unicode string
_u = lambda t: t.decode('UTF-8', 'replace') if isinstance(t, str) else t
_uu = lambda *tt: tuple(_u(t) for t in tt) 
# guarantee byte string in UTF8 encoding
_u8 = lambda t: t.encode('UTF-8', 'replace') if isinstance(t, unicode) else t
_uu8 = lambda *tt: tuple(_u8(t) for t in tt)

Applied to your question:

import json
o = (5790, u"Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo            ", 60,
 True, '40141613')
as_json = json.dumps(_uu8(*o))
as_obj = json.loads(as_json)
print "object\n ", o
print "json (type %s)\n %s " % (type(as_json), as_json)
print "object again\n ", as_obj

=>

object
  (5790, u'Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo            ', 60, True, '40141613')
json (type <type 'str'>)
  [5790, "Vlv-Gate-Assy-Mdl-\u00e1M1-2-\u00e19/16-10K-BB Credit Memo            ", 60, true, "40141613"]
object again
  [5790, u'Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo            ', 60, True, u'40141613']

Here's some more reasoning about this.

Share:
33,705
deostroll
Author by

deostroll

You are probably reading this space for the WRONG reasons...statistically speaking. Why? Because the SO community isn't so welcoming... But we can make this work...just hang in there... :) https://codeblog.jonskeet.uk/2018/03/17/stack-overflow-culture/amp/ (At least, read the "Jon’s Stack Overflow Covenant" section) About me Software engineer. Loves JavaScript. Python. Science. Astronomy. IoT. Programming. Movies. Computers. Married. 1 Kid. From Kerala, India. Works in Bangalore. More ways to connect -&gt; https://plus.google.com/+ArunJayapal (linkedin, facebook, twitter, blah, blah)

Updated on November 07, 2020

Comments

  • deostroll
    deostroll over 3 years

    I have strings as follows in my python list (taken from command prompt):

    >>> o['records'][5790]
    (5790, 'Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo            ', 60,
     True, '40141613')
    >>>
    

    I have tried suggestions as mentioned here: Changing default encoding of Python?

    Further changed the default encoding to utf-16 too. But still json.dumps() threw and exception as follows:

    >>> write(o)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "okapi_create_master.py", line 49, in write
        o = json.dumps(output)
      File "C:\Python27\lib\json\__init__.py", line 231, in dumps
        return _default_encoder.encode(obj)
      File "C:\Python27\lib\json\encoder.py", line 201, in encode
        chunks = self.iterencode(o, _one_shot=True)
      File "C:\Python27\lib\json\encoder.py", line 264, in iterencode
        return _iterencode(o, 0)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 25: invalid
    continuation byte
    

    Can't figure what kind of transformation is required for such strings so that json.dumps() works.