"for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte

743,196

Solution 1

As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1", so replacing open("u.item", encoding="utf-8") with open('u.item', encoding = "ISO-8859-1") will solve the problem.

Solution 2

The following also worked for me. ISO 8859-1 is going to save a lot, mainly if using Speech Recognition APIs.

Example:

file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1")

Solution 3

Your file doesn't actually contain UTF-8 encoded data; it contains some other encoding. Figure out what that encoding is and use it in the open call.

In Windows-1252 encoding, for example, the 0xe9 would be the character é.

Solution 4

Try this to read using Pandas:

pd.read_csv('u.item', sep='|', names=m_cols, encoding='latin-1')

Solution 5

This works:

open('filename', encoding='latin-1')

Or:

open('filename', encoding="ISO-8859-1")

View more solutions

743,196

SujitS

Updated on March 17, 2022

Comments

SujitS about 2 years
Here is my code,
```
for line in open('u.item'):
# Read each line
```
Whenever I run this code it gives the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte

I tried to solve this and add an extra parameter in open(). The code looks like:
```
for line in open('u.item', encoding='utf-8'):
# Read each line
```
But again it gives the same error. What should I do then?
- Andreas Jung over 10 years
  
  Badly encoded data I would assume.
- Mark Tolonen over 10 years
  
  Or just not UTF-8 data.
- tripleee over 5 years
  
  Possible duplicate of Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError?
- Jesse W. Collins almost 5 years
  
  We had this error with msgpack when using python 3 instead of python 2.7. For us, the course of action was to work with python 2.7.
SujitS over 10 years

So, How can I find out what encoding is it! I am using linux
RemcoGerlich over 10 years

There is no way to do that that always works, but see the answer to this question: stackoverflow.com/questions/436220/…
0 _ almost 8 years

Explicit is better than implicit (PEP 20).
SujitS about 7 years

But this is version 3
Jeril about 7 years

Yeah I know. I thought it might be helpful for the people using Python 2
fenkerbb over 6 years

Worked for me in Python3 as well
RolfBly over 6 years

You may be correct that the OP is reading ISO 8859-1, as can be deduced from the 0xe9 (é) in the error message, but you should explain why your solution works. The reference to speech recognitions API's does not help.
Max Candocia over 6 years

In case you want something easier to remember, 'ISO-8859-1' is also known as 'latin-1' or 'latin1'.
Kjeld Flarup about 6 years

The trick is that ISO-8859-1 or Latin_1 is 8 bit character sets, thus all garbage has a valid value. Perhaps not useable, but if you want to ignore!
Евгений Коптюбенко over 5 years

I had the same issue UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 32: invalid continuation byte. I used python 3.6.5 to install aws cli. And when I tried aws --version it failed with this error. So I had to edit /Library/Frameworks/Python.framework/Versions/3.6/lib/python‌3.6/configparser.py and changed the code to the following def read(self, filenames, encoding="ISO-8859-1"):
JoseOrtiz3 over 5 years

Is there an automatic way of detecting encoding?
VertigoRay about 5 years

@OrangeSherbet I implemented detection using chardet. Here's the one-liner (after import chardet): chardet.detect(open(in_file, 'rb').read())['encoding']. Check out this answer for details: stackoverflow.com/a/3323810/615422
JohnAndrews over 4 years

How do you get the encoding of a file?
Alastair McCormack over 4 years

Not sure why you're suggesting Pandas. The solution is setting the correct encoding, which you've chanced upon here.
Mark Ransom almost 4 years

Note that 'ISO-8859-1' will always work even if it's not the right encoding, because each of the 256 byte values maps to a Unicode character. I believe it's the only encoding which does this.
MartenCatcher almost 4 years

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
Billy over 3 years

I like @VertigoRay's suggestion of chardet in a script, but for something really quick to diagnose what's going on, a simple file helped me: % file list.log list.log: ISO-8859 text vs %file playlist.txt playlist.txt: UTF-8 Unicode text, with CRLF, LF line terminators
Silidrone over 3 years

@MartenCatcher yeah but it helps future visitors to the question, although more explanation put to the answer would make it much better, I believe it serves better purpose as an answer rather than as a comment
Peter Mortensen over 3 years

'latin-1' is the same as 'ISO-8859-1'?
Peter Mortensen over 3 years

What is the intent? Ignoring errors? What are the consequences?
Peter Mortensen over 3 years

Notepad++ is Windows only. For example, it doesn't work on Linux.
Peter Mortensen over 3 years

What is "Encodage"? What language?
Mark Ransom about 3 years

@PeterMortensen yes it is, Wikipedia confirms it. They both produce the same output when used with decode in Python as well.
Mark Ransom about 3 years

Depends on what you mean by "works". If you mean avoids exceptions that's true, because it's the only encoding that doesn't have invalid bytes or sequences. Doesn't mean you'll get the proper characters though.
Thierry Lathuille over 2 years

Please note that questions and answers on SO must be in English only - even if the problem you encountered may bite mainly programmers using cyrillic alphabet.
Nikita Axenov over 2 years

@ThierryLathuille, is it a real problem? Could you please give me a link/referense to the comunity rule on that issue?
Thierry Lathuille over 2 years

This is considered a real problem - and is probably what caused your answer to get downvoted. Non-English content is not allowed on SO (see for example meta.stackoverflow.com/questions/297673/… ), and the rule is really strictly respected. For questions in Russian, you have ru.stackoverflow.com , though ;)
Anonymous over 2 years

@ThierryLathuille This applies to the English content, not problems with non-English symbols. And this doesn't necessarily have to be about other languages, it could be a different UTF-8 character (for example, a checkmark).
Pooya Estakhri over 2 years

I was looking for an answer and interestingly you've answered 7 hours ago to a question asked 8 years ago. interesting coincidence .
Fred Zimmerman over 2 years

@vertigoray answer should be the accepted one IMHO -- answers without encoding detection cannot reliably solve the question
Mark Ransom over 2 years

I don't get it, why would you use a 33-line program to avoid typing one line in the shell?
Mark Ransom over 2 years

How would opening a file that doesn't exist generate a UnicodeDecodeError? And in Python it's customary to use the EAFP principle over the LBYL that you're endorsing here.
Mark Ransom over 2 years

@OrangeSherbet there's no sure way unless you can find out from whoever produced the file. But it's possible to guess based on the file contents and some guessing methods are better than others. By coincidence I chanced on a new way to do it in Python the other day: charset-normalizer.
Arpad Horvath -- Слава Україні about 2 years

Nobody said that the file in the question is a csv file.
JGaber about 2 years

"Encodage" is "Encoding" if the menu is in French