UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-6: invalid data

161,184

Solution 1

The string you're trying to parse as a JSON is not encoded in UTF-8. Most likely it is encoded in ISO-8859-1. Try the following:

json.loads(unicode(opener.open(...), "ISO-8859-1"))

That will handle any umlauts that might get in the JSON message.

You should read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). I hope that it will clarify some issues you're having around Unicode.

Solution 2

My solution is a bit funny.I never thought that would it be as easy as save as with UTF-8 codec.I'm using notepad++(v5.6.8).I didn't notice that I saved it with ANSI codec initially. I'm using separate file to place all localized dictionary. I found my solution under 'Encoding' tab from my Notepad++.I select 'Encoding in UTF-8 without BOM' and save it. It works brilliantly.

Solution 3

The error you're seeing means the data you receive from the remote end isn't valid JSON. JSON (according to the specifiation) is normally UTF-8, but can also be UTF-16 or UTF-32 (in either big- or little-endian.) The exact error you're seeing means some part of the data was not valid UTF-8 (and also wasn't UTF-16 or UTF-32, as those would produce different errors.)

Perhaps you should examine the actual response you receive from the remote end, instead of blindly passing the data to json.loads(). Right now, you're reading all the data from the response into a string and assuming it's JSON. Instead, check the content type of the response. Make sure the webpage is actually claiming to give you JSON and not, for example, an error message that isn't JSON.

(Also, after checking the response use json.load() by passing it the file-like object returned by opener.open(), instead of reading all data into a string and passing that to json.loads().)

Solution 4

The solution to change the encoding to Latin1 / ISO-8859-1 solves an issue I observed with html2text.py as invoked on an output of tex4ht. I use that for an automated word count on LaTeX documents: tex4ht converts them to HTML, and then html2text.py strips them down to pure text for further counting through wc -w. Now, if, for example, a German "Umlaut" comes in through a literature database entry, that process would fail as html2text.py would complain e.g.

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 32243-32245: invalid data

Now these errors would then subsequently be particularly hard to track down, and essentially you want to have the Umlaut in your references section. A simple change inside html2text.py from

data = data.decode(encoding)

to

data = data.decode("ISO-8859-1")

solves that issue; if you're calling the script using the HTML file as first parameter, you can also pass the encoding as second parameter and spare the modification.

Solution 5

Just in case of someone has the same problem. I'am using vim with YouCompleteMe, failed to start ycmd with this error message, what I did is: export LC_CTYPE="en_US.UTF-8", the problem is gone.

Share:
161,184
ihucos
Author by

ihucos

Updated on June 05, 2020

Comments

  • ihucos
    ihucos almost 4 years

    how does the unicode thing works on python2? i just dont get it.

    here i download data from a server and parse it for JSON.

    Traceback (most recent call last):
      File "/usr/local/lib/python2.6/dist-packages/eventlet-0.9.12-py2.6.egg/eventlet/hubs/poll.py", line 92, in wait
        readers.get(fileno, noop).cb(fileno)
      File "/usr/local/lib/python2.6/dist-packages/eventlet-0.9.12-py2.6.egg/eventlet/greenthread.py", line 202, in main
        result = function(*args, **kwargs)
      File "android_suggest.py", line 60, in fetch
        suggestions = suggest(chars)
      File "android_suggest.py", line 28, in suggest
        return [i['s'] for i in json.loads(opener.open('https://market.android.com/suggest/SuggRequest?json=1&query='+s+'&hl=de&gl=DE').read())]
      File "/usr/lib/python2.6/json/__init__.py", line 307, in loads
        return _default_decoder.decode(s)
      File "/usr/lib/python2.6/json/decoder.py", line 319, in decode
        obj, end = self.raw_decode(s, idx=_w(s, 0).end())
      File "/usr/lib/python2.6/json/decoder.py", line 336, in raw_decode
        obj, end = self._scanner.iterscan(s, **kw).next()
      File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
        rval, next_pos = action(m, context)
      File "/usr/lib/python2.6/json/decoder.py", line 217, in JSONArray
        value, end = iterscan(s, idx=end, context=context).next()
      File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
        rval, next_pos = action(m, context)
      File "/usr/lib/python2.6/json/decoder.py", line 183, in JSONObject
        value, end = iterscan(s, idx=end, context=context).next()
      File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
        rval, next_pos = action(m, context)
      File "/usr/lib/python2.6/json/decoder.py", line 155, in JSONString
        return scanstring(match.string, match.end(), encoding, strict)
    UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-6: invalid data
    

    thank you!!

    EDIT: the following string causes the error: '[{"t":"q","s":"abh\xf6ren"}]'. \xf6 should be decoded to ö (abhören)

  • ihucos
    ihucos almost 13 years
    it didnt work: 'ascii' codec can't decode byte 0xf6 in position 18: ordinal not in range(128)
  • Thomas Wouters
    Thomas Wouters almost 13 years
    unicode(chars) is never the right way to decode to unicode (you at least need to specify the encoding), and that really isn't the problem here anyway.
  • John Machin
    John Machin almost 13 years
    ANSI is "American National Standards Institute".
  • ihucos
    ihucos almost 13 years
    opener.open does not return a file like object: TypeError: expected string or buffer. i edited my post and added the string that is causing the problem
  • ihucos
    ihucos almost 13 years
    ok, i added the string that is causing the problem to my post.
  • Thomas Wouters
    Thomas Wouters almost 13 years
    That TypeError means you're still using json.loads() instead of json.load(). opener.open() does return a file-like object, because you use it as one in your code. The JSON string you have is invalid JSON -- \xf6 is not ü in UTF-8, only in some single-byte encodings (like iso-8859-1.) JSON is not supposed to be given in those encodings, just UTF-8, UTF-16 or UTF-32. You will either have to fix the supplier of the JSON (to make it use \\u00f6 instead of \xf6) or find out what encoding this is and recode into UTF-8 before parsing it as JSON.
  • Thomas Wouters
    Thomas Wouters almost 13 years
    The obvious problem in the OP's case is that the data is not UTF-8, so decoding it using UTF-8 will not work either. JSON data is bytes, not unicode. It contains unicode, and the bytes are supposed to be encoded in UTF-8, UTF-16 or UTF-32, but they aren't that here.
  • ihucos
    ihucos almost 13 years
    oops, sory i did notice the "s". but thank you, i will just give it up )-:
  • Gopakumar N G
    Gopakumar N G over 10 years
    NameError: global name 'opener' is not defined ?
  • Tadeusz A. Kadłubowski
    Tadeusz A. Kadłubowski over 10 years
    @GopakumarNG: Look into urllib library to understand normal's example stacktrace.
  • sam boosalis
    sam boosalis over 10 years
    @TadeuszA.Kadłubowski thanks for "ISO-8859-1". i didn't know about this encoding, and it randomly patched this nltk bug for me.
  • Admin
    Admin almost 9 years
    @GopakumarNG: json.load( open( fname, 'r' ), encoding='ISO-8859-1' )
  • Dave Liu
    Dave Liu almost 9 years
    @samboosalis nltk bug link broken
  • AmaChefe
    AmaChefe over 8 years
    I wish i could upvote your 10 times. THIS was the problem i battled for 2 days!
  • AmaChefe
    AmaChefe over 8 years
    In my case I choose 'Encoding in UTF8' and it still worked, but exposed odd characters which i replaced. Everybody using Notepad++ or other editors should take note.
  • user805981
    user805981 almost 8 years
    @TadeuszA.Kadłubowski I'm having a similar problem with filestorage from flask. How should I be loading filestorage object type in this?
  • Sarang Manjrekar
    Sarang Manjrekar over 7 years
    I am reading the data, inside a class, by a method. I had same error with my data, now when I am trying the ISO-8859-1 encoding in an explicit function, it goes through well. But when I use it in method, it still gives the error : 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte
  • ierdna
    ierdna about 7 years
    TL;DR for the article: the author of the text must supply you with encoding. and the reverse: if you're the author/programmer (e.g. html, email) you must include encoding in it.