"for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte

743,196

Solution 1

As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1", so replacing open("u.item", encoding="utf-8") with open('u.item', encoding = "ISO-8859-1") will solve the problem.

Solution 2

The following also worked for me. ISO 8859-1 is going to save a lot, mainly if using Speech Recognition APIs.

Example:

file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1")

Solution 3

Your file doesn't actually contain UTF-8 encoded data; it contains some other encoding. Figure out what that encoding is and use it in the open call.

In Windows-1252 encoding, for example, the 0xe9 would be the character é.

Solution 4

Try this to read using Pandas:

pd.read_csv('u.item', sep='|', names=m_cols, encoding='latin-1')

Solution 5

This works:

open('filename', encoding='latin-1')

Or:

open('filename', encoding="ISO-8859-1")
Share:
743,196

Related videos on Youtube

SujitS
Author by

SujitS

Updated on March 17, 2022

Comments

  • SujitS
    SujitS about 2 years

    Here is my code,

    for line in open('u.item'):
    # Read each line
    

    Whenever I run this code it gives the following error:

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte

    I tried to solve this and add an extra parameter in open(). The code looks like:

    for line in open('u.item', encoding='utf-8'):
    # Read each line
    

    But again it gives the same error. What should I do then?

    • Andreas Jung
      Andreas Jung over 10 years
      Badly encoded data I would assume.
    • Mark Tolonen
      Mark Tolonen over 10 years
      Or just not UTF-8 data.
    • tripleee
      tripleee over 5 years
    • Jesse W. Collins
      Jesse W. Collins almost 5 years
      We had this error with msgpack when using python 3 instead of python 2.7. For us, the course of action was to work with python 2.7.
  • SujitS
    SujitS over 10 years
    So, How can I find out what encoding is it! I am using linux
  • RemcoGerlich
    RemcoGerlich over 10 years
    There is no way to do that that always works, but see the answer to this question: stackoverflow.com/questions/436220/…
  • 0 _
    0 _ almost 8 years
    Explicit is better than implicit (PEP 20).
  • SujitS
    SujitS about 7 years
    But this is version 3
  • Jeril
    Jeril about 7 years
    Yeah I know. I thought it might be helpful for the people using Python 2
  • fenkerbb
    fenkerbb over 6 years
    Worked for me in Python3 as well
  • RolfBly
    RolfBly over 6 years
    You may be correct that the OP is reading ISO 8859-1, as can be deduced from the 0xe9 (é) in the error message, but you should explain why your solution works. The reference to speech recognitions API's does not help.
  • Max Candocia
    Max Candocia over 6 years
    In case you want something easier to remember, 'ISO-8859-1' is also known as 'latin-1' or 'latin1'.
  • Kjeld Flarup
    Kjeld Flarup about 6 years
    The trick is that ISO-8859-1 or Latin_1 is 8 bit character sets, thus all garbage has a valid value. Perhaps not useable, but if you want to ignore!
  • Евгений Коптюбенко
    Евгений Коптюбенко over 5 years
    I had the same issue UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 32: invalid continuation byte. I used python 3.6.5 to install aws cli. And when I tried aws --version it failed with this error. So I had to edit /Library/Frameworks/Python.framework/Versions/3.6/lib/python‌​3.6/configparser.py and changed the code to the following def read(self, filenames, encoding="ISO-8859-1"):
  • JoseOrtiz3
    JoseOrtiz3 over 5 years
    Is there an automatic way of detecting encoding?
  • VertigoRay
    VertigoRay about 5 years
    @OrangeSherbet I implemented detection using chardet. Here's the one-liner (after import chardet): chardet.detect(open(in_file, 'rb').read())['encoding']. Check out this answer for details: stackoverflow.com/a/3323810/615422
  • JohnAndrews
    JohnAndrews over 4 years
    How do you get the encoding of a file?
  • Alastair McCormack
    Alastair McCormack over 4 years
    Not sure why you're suggesting Pandas. The solution is setting the correct encoding, which you've chanced upon here.
  • Mark Ransom
    Mark Ransom almost 4 years
    Note that 'ISO-8859-1' will always work even if it's not the right encoding, because each of the 256 byte values maps to a Unicode character. I believe it's the only encoding which does this.
  • MartenCatcher
    MartenCatcher almost 4 years
    This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
  • Billy
    Billy over 3 years
    I like @VertigoRay's suggestion of chardet in a script, but for something really quick to diagnose what's going on, a simple file helped me: % file list.log list.log: ISO-8859 text vs %file playlist.txt playlist.txt: UTF-8 Unicode text, with CRLF, LF line terminators
  • Silidrone
    Silidrone over 3 years
    @MartenCatcher yeah but it helps future visitors to the question, although more explanation put to the answer would make it much better, I believe it serves better purpose as an answer rather than as a comment
  • Peter Mortensen
    Peter Mortensen over 3 years
    'latin-1' is the same as 'ISO-8859-1'?
  • Peter Mortensen
    Peter Mortensen over 3 years
    What is the intent? Ignoring errors? What are the consequences?
  • Peter Mortensen
    Peter Mortensen over 3 years
    Notepad++ is Windows only. For example, it doesn't work on Linux.
  • Peter Mortensen
    Peter Mortensen over 3 years
    What is "Encodage"? What language?
  • Mark Ransom
    Mark Ransom about 3 years
    @PeterMortensen yes it is, Wikipedia confirms it. They both produce the same output when used with decode in Python as well.
  • Mark Ransom
    Mark Ransom about 3 years
    Depends on what you mean by "works". If you mean avoids exceptions that's true, because it's the only encoding that doesn't have invalid bytes or sequences. Doesn't mean you'll get the proper characters though.
  • Thierry Lathuille
    Thierry Lathuille over 2 years
    Please note that questions and answers on SO must be in English only - even if the problem you encountered may bite mainly programmers using cyrillic alphabet.
  • Nikita Axenov
    Nikita Axenov over 2 years
    @ThierryLathuille, is it a real problem? Could you please give me a link/referense to the comunity rule on that issue?
  • Thierry Lathuille
    Thierry Lathuille over 2 years
    This is considered a real problem - and is probably what caused your answer to get downvoted. Non-English content is not allowed on SO (see for example meta.stackoverflow.com/questions/297673/… ), and the rule is really strictly respected. For questions in Russian, you have ru.stackoverflow.com , though ;)
  • Anonymous
    Anonymous over 2 years
    @ThierryLathuille This applies to the English content, not problems with non-English symbols. And this doesn't necessarily have to be about other languages, it could be a different UTF-8 character (for example, a checkmark).
  • Pooya Estakhri
    Pooya Estakhri over 2 years
    I was looking for an answer and interestingly you've answered 7 hours ago to a question asked 8 years ago. interesting coincidence .
  • Fred Zimmerman
    Fred Zimmerman over 2 years
    @vertigoray answer should be the accepted one IMHO -- answers without encoding detection cannot reliably solve the question
  • Mark Ransom
    Mark Ransom over 2 years
    I don't get it, why would you use a 33-line program to avoid typing one line in the shell?
  • Mark Ransom
    Mark Ransom over 2 years
    How would opening a file that doesn't exist generate a UnicodeDecodeError? And in Python it's customary to use the EAFP principle over the LBYL that you're endorsing here.
  • Mark Ransom
    Mark Ransom over 2 years
    @OrangeSherbet there's no sure way unless you can find out from whoever produced the file. But it's possible to guess based on the file contents and some guessing methods are better than others. By coincidence I chanced on a new way to do it in Python the other day: charset-normalizer.
  • Arpad Horvath -- Слава Україні
    Arpad Horvath -- Слава Україні about 2 years
    Nobody said that the file in the question is a csv file.
  • JGaber
    JGaber about 2 years
    "Encodage" is "Encoding" if the menu is in French