UnicodeDecodeError on python3

17,063

It looks like it is invalid UTF-8 and you should try to read with latin-1 encoding. Try

file = open('exampleFileName', 'r', encoding='latin-1') 
Share:
17,063
EliteKaffee
Author by

EliteKaffee

Updated on June 26, 2022

Comments

  • EliteKaffee
    EliteKaffee over 1 year

    Im currently trying to use some simple regex on a very big .txt file (couple of million lines of text). The most simple code that causes the problem:

    file = open("exampleFileName", "r")  
        for line in file:  
            pass
    

    The error message:

    Traceback (most recent call last):
      File "example.py", line 34, in <module>
        example()
      File "example.py", line 16, in example
        for line in file:
      File "/usr/lib/python3.4/codecs.py", line 319, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7332: invalid continuation byte
    

    How can i fix this? is utf-8 the wrong encoding? And if it is, how do i know which one is right?

    Thanks and best regards!

    • Jeff
      Jeff about 7 years
    • Admin
      Admin about 7 years
      Post the output of file -bi [your_filename]. You'll get an encoding. After that provide the encoding argument to open().
    • Reihan_amn
      Reihan_amn over 5 years
      what does file -bi command does?
  • chivorotkiv
    chivorotkiv almost 6 years
    Do you know how to do the same when reading from command line? I use input() function, is there a way to configure its encoding or is there some other configurable function?
  • Reihan_amn
    Reihan_amn over 5 years
    How did you figure out to use latin-1 encoding?
  • mic4ael
    mic4ael over 5 years
    0xed is í characters which you can find in the latin-1 encoding
  • Reihan_amn
    Reihan_amn over 5 years
    So confused! after unicode encoding came into the scene to cover all ~2 m code point, why latin-1 encoding is still here? shouldn't latin-1 encoding be a subset of UTF encoding? shouldn't all defined codes in latin-1 be now a part of UTF? if so, why UTF cannot support it? (sorry I am kinda new in this field)