'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

74,480

Solution 1

Encoding in the file is 'windows-1252'. Use:

open('txt.tsv', encoding='windows-1252')

Solution 2

If someone works on Turkish data, then I suggest this line:

df = pd.read_csv("text.txt",encoding='windows-1254')

Solution 3

ds = pd.read_csv('/Dataset/test.csv', encoding='windows-1252') 

Works fine for me, thanks.

Solution 4

i have the same error message for .csv file, and This Worked for me :

     df = pd.read_csv('Text.csv',encoding='ANSI')

Solution 5

I also encountered the same issue and worked while using latin1 encoding, refer to the sample code to apply in your codebase. Give a try if above resolution doesn't work.

df=pd.read_csv("../CSV_FILE.csv",na_values=missing, encoding='latin1')
Share:
74,480
Vital
Author by

Vital

Updated on February 09, 2022

Comments

  • Vital
    Vital about 2 years

    I try to read and print the following file: txt.tsv (https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2017q3_notes.zip)

    According to the SEC the data set is provided in a single encoding, as follows:

    Tab Delimited Value (.txt): utf-8, tab-delimited, \n- terminated lines, with the first line containing the field names in lowercase.

    My current code:

    import csv
    
    with open('txt.tsv') as tsvfile:
        reader = csv.DictReader(tsvfile, dialect='excel-tab')
        for row in reader:
            print(row)
    

    All attempts ended with the following error message:

    'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

    I am a bit lost. Can anyone help me? Many thanks in advance.

  • Vital
    Vital over 6 years
    Thank you very much!! That works! May I ask you why it works with 'windows-1252' although the SEC states it is 'utf-8'?
  • ShadowRanger
    ShadowRanger over 6 years
    Are you sure it's cp1252? The file I downloaded appeared to be ASCII. If it's not UTF-8, and not ASCII, it could be literally any single-byte-per-character ASCII superset and you'd only be able to guess at the encoding heuristically (it would successfully decode under any of them, but the results might be garbage).
  • koPytok
    koPytok over 6 years
    @Vital Better ask SEC
  • koPytok
    koPytok over 6 years
    @ShadowRanger encoding detector detected cp-1252 and the result seems to be legit
  • tripleee
    tripleee over 6 years
    This has the potential of producing invalid results. CP-1252 will happily decode anything (audio data, core dumps, zip archives) and pretend it's all valid text.
  • tripleee
    tripleee over 6 years
    Casual inspection of my download of txt.tsv indicates no 0xa0 character at the offset indicated in the question, but plenty of 0xa0 characters which are apparently representing hard spaces, and 0xac characters in a position which indicates a currency indicator as well as 0xae which apparently is the ®‎ symbol. This is almost consistent with CP1252 or ISO-8859-1 (which of course are very similar), but the 0xac doesn't fit with either. Maybe see also cdn.rawgit.com/tripleee/8bit/master/encodings.html#ac (cough.)