python: UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 0: invalid start byte

27,444

Solution 1

This is, indeed, invalid UTF-8. In UTF-8, only code points in the range U+0080 to U+07FF, inclusive, can be encoded using two bytes. Read the Wikipedia article more closely, and you will see the same thing. As a result, the byte 0xc0 may not appear in UTF-8, ever. The same is true of 0xc1.

Some UTF-8 decoders have erroneously decoded sequences like C0 AF as valid UTF-8, which has lead to security vulnerabilities in the past.

Solution 2

Found one standard that actually accepts 0xc0 : encoding="ISO-8859-1"
from https://stackoverflow.com/a/27456542/4355695

But this entails making sure the rest of the file doesn't have unicode chars, so this would not be an exact answer to the question, but may be helpful for folks like me who didn't have any unicode chars in their file anyways and just wanted python to load the damn thing and both utf-8 and ascii encodings were erroring out.

More on ISO-8859-1 : What is the difference between UTF-8 and ISO-8859-1?

Share:
27,444
Admin
Author by

Admin

Updated on July 17, 2022

Comments

  • Admin
    Admin almost 2 years

    I'm trying to write a script that generates random unicode by creating random utf-8 encoded strings and then decoding those to unicode. It works fine for a single byte, but with two bytes it fails.

    For instance, if I run the following in a python shell:

    >>> a = str()

    >>> a += chr(0xc0) + chr(0xaf)

    >>> print a.decode('utf-8')

    UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 0: invalid start byte
    

    According to the utf-8 scheme https://en.wikipedia.org/wiki/UTF-8#Description the byte sequence 0xc0 0xaf should be valid as 0xc0 starts with 110 and 0xaf starts with 10.


    Here's my python script:

    def unicode(self):
        '''returns a random (astral) utf encoded byte string'''
        num_bytes = random.randint(1,4)
        if num_bytes == 1:
            return self.gen_utf8(num_bytes, 0x00, 0x7F)
        elif num_bytes == 2:
            return self.gen_utf8(num_bytes, 0xC0, 0xDF)
        elif num_bytes == 3:
            return self.gen_utf8(num_bytes, 0xE0, 0xEF)
        elif num_bytes == 4:
            return self.gen_utf8(num_bytes, 0xF0, 0xF7)
    
    def gen_utf8(self, num_bytes, start_val, end_val):
        byte_str = list()
        byte_str.append(random.randrange(start_val, end_val)) # start byte
        for i in range(0,num_bytes-1):
            byte_str.append(random.randrange(0x80,0xBF)) # trailing bytes
        a = str()
        sum = int()
        for b in byte_str:
            a += chr(b) 
        ret = a.decode('utf-8')
        return ret
    
    if __name__ == "__main__":
        g = GenFuzz()
        print g.gen_utf8(2,0xC0,0xDF)
    
  • Nikhil VJ
    Nikhil VJ about 6 years
    What then would be an encoding that tolerates 0xc0? Or, how do I zap this annoying character off my file? My pandas read_table function is getting stuck here.
  • Dietrich Epp
    Dietrich Epp about 6 years
    That's a difficult question to answer. It's like saying you have a hungry cat in your house. I don't know if you should feed the cat because it's yours, if you should call animal control because it's a stray cat, or if there's a tiger loose from the zoo and you have a serious problem. It's the same way with data. I don't know if you want to keep the 0xc0 because it's important, get rid of it because you are okay with approximate data, or whether the fact that you have 0xc0 in the first place indicates a serious problem somewhere else.
  • Nikhil VJ
    Nikhil VJ about 6 years
    Found a workaround : encoding="ISO-8859-1" from stackoverflow.com/a/27456542/4355695
  • Nikhil VJ
    Nikhil VJ about 6 years
    You don't need to take too much tension for others' data, let them bear the consequences if they killed the wrong cat :P
  • Dietrich Epp
    Dietrich Epp about 6 years
    @nikhilvj: If I just wanted to let other people bear the consequences for making uninformed decisions, I wouldn't be answering questions on this site.
  • Dietrich Epp
    Dietrich Epp about 6 years
    This answer should be attached to a different question.
  • Nikhil VJ
    Nikhil VJ about 6 years
    This (OP) is the question I came across as I was searching for this (my) answer.
  • Nikhil VJ
    Nikhil VJ about 6 years
    Ok created a separate question here: stackoverflow.com/questions/49845554/…