Python 3 CSV file giving UnicodeDecodeError: 'utf-8' codec can't decode byte error when I print

117,861

Solution 1

We know the file contains the byte b'\x96' since it is mentioned in the error message:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 7386: invalid start byte

Now we can write a little script to find out if there are any encodings where b'\x96' decodes to ñ:

import pkgutil
import encodings
import os

def all_encodings():
    modnames = set([modname for importer, modname, ispkg in pkgutil.walk_packages(
        path=[os.path.dirname(encodings.__file__)], prefix='')])
    aliases = set(encodings.aliases.aliases.values())
    return modnames.union(aliases)

text = b'\x96'
for enc in all_encodings():
    try:
        msg = text.decode(enc)
    except Exception:
        continue
    if msg == 'ñ':
        print('Decoding {t} with {enc} is {m}'.format(t=text, enc=enc, m=msg))

which yields

Decoding b'\x96' with mac_roman is ñ
Decoding b'\x96' with mac_farsi is ñ
Decoding b'\x96' with mac_croatian is ñ
Decoding b'\x96' with mac_arabic is ñ
Decoding b'\x96' with mac_romanian is ñ
Decoding b'\x96' with mac_iceland is ñ
Decoding b'\x96' with mac_turkish is ñ

Therefore, try changing

with open('my_file.csv', 'r', newline='') as csvfile:

to one of those encodings, such as:

with open('my_file.csv', 'r', encoding='mac_roman', newline='') as csvfile:

Solution 2

with open('my_file.csv', 'r', newline='', encoding='ISO-8859-1') as csvfile:

ñ character is not listed on UTC-8 encoding. To fix the issue, you may use ISO-8859-1 encoding instead. For more details about this encoding, you may refer to the link below: https://www.ic.unicamp.br/~stolfi/EXPORT/www/ISO-8859-1-Encoding.html

Solution 3

For others who hit the same error shown in the subject, watch out for the file encoding of your csv file. Its possible it is not utf-8. I just noticed that LibreOffice created a utf-16 encoded file for me today without prompting me although I could not reproduce this.

If you try to open a utf-16 encoded document using open(... encoding='utf-8'), you will get the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

To fix either specify 'utf-16' encoding or change the encoding of the csv.

Solution 4

I also faced the issue with python 3 and my issue got resolved using the encoding type as utf-16

with open('data.csv', newline='',encoding='utf-16') as csvfile:

Solution 5

easy... just open it in Excel or OpenOffice calc, use text as columns, select ,, and then just save the file as .csv... it takes me one day and several hour of search in google... but at the end i figure it out.

Share:
117,861
HLH
Author by

HLH

Updated on November 16, 2021

Comments

  • HLH
    HLH over 2 years

    I have the following code in Python 3, which is meant to print out each line in a csv file.

    import csv
    with open('my_file.csv', 'r', newline='') as csvfile:
        lines = csv.reader(csvfile, delimiter = ',', quotechar = '|')
        for line in lines:
            print(' '.join(line))
    

    But when I run it, it gives me this error:

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 7386: invalid start byte
    

    I looked through the csv file, and it turns out that if I take out a single ñ (little n with a tilde on top), every line prints out fine.

    My problem is that I've looked through a bunch of different solutions to similar problems, but I still have no idea how to fix this, what to decode/encode, etc. Simply taking out the ñ character in the data is NOT an option.

  • Wooble
    Wooble over 10 years
    That won't work, because the error message indicates it's already trying to use the UTF-8 codec.
  • ezdazuzena
    ezdazuzena almost 8 years
    ..though another encoding might work. In my case latin-1 did the job
  • ParisNakitaKejser
    ParisNakitaKejser over 6 years
    Its work for me, but why with mac_roman and not utf-8 as encoding?
  • Tom
    Tom about 5 years
    I had this exact problem. After pulling my hair out, I found this suggestion. FWIW, if you are using Excel 2013+, save the file as "CSV (MS DOS)"
  • Shalini Baranwal
    Shalini Baranwal about 5 years
    Wonderful answer, even i got to solve the problem with mac_roman encoding.
  • Marcel
    Marcel almost 5 years
    I dont understand why this answer has downvotes. To set the correct file encoding has definitively solved the issue.
  • tgraybam
    tgraybam almost 4 years
    +1 Yes, this is a common gotcha. For CSV files, if Excel saved a file with some 'utf-16' encoding you didn't want then even when you've removed the offending unicode characters you want to make sure the file format is 'CSV UTF-8 (Comma delimited) (.csv)' when you save it (or 'save as' it).
  • p7adams
    p7adams over 3 years
    great explanation!