'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

python csv encoding utf-8

74,480

Solution 1

Encoding in the file is 'windows-1252'. Use:

open('txt.tsv', encoding='windows-1252')

Solution 2

If someone works on Turkish data, then I suggest this line:

df = pd.read_csv("text.txt",encoding='windows-1254')

Solution 3

ds = pd.read_csv('/Dataset/test.csv', encoding='windows-1252')

Works fine for me, thanks.

Solution 4

i have the same error message for .csv file, and This Worked for me :

     df = pd.read_csv('Text.csv',encoding='ANSI')

Solution 5

I also encountered the same issue and worked while using latin1 encoding, refer to the sample code to apply in your codebase. Give a try if above resolution doesn't work.

df=pd.read_csv("../CSV_FILE.csv",na_values=missing, encoding='latin1')

View more solutions

74,480

Author by

Vital

Updated on February 09, 2022

Comments

Vital about 2 years
I try to read and print the following file: txt.tsv (https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2017q3_notes.zip)

According to the SEC the data set is provided in a single encoding, as follows:

Tab Delimited Value (.txt): utf-8, tab-delimited, \n- terminated lines, with the first line containing the field names in lowercase.

My current code:
```
import csv

with open('txt.tsv') as tsvfile:
    reader = csv.DictReader(tsvfile, dialect='excel-tab')
    for row in reader:
        print(row)
```
All attempts ended with the following error message:

'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

I am a bit lost. Can anyone help me? Many thanks in advance.
Vital over 6 years

Thank you very much!! That works! May I ask you why it works with 'windows-1252' although the SEC states it is 'utf-8'?
ShadowRanger over 6 years

Are you sure it's cp1252? The file I downloaded appeared to be ASCII. If it's not UTF-8, and not ASCII, it could be literally any single-byte-per-character ASCII superset and you'd only be able to guess at the encoding heuristically (it would successfully decode under any of them, but the results might be garbage).
koPytok over 6 years

@Vital Better ask SEC
koPytok over 6 years

@ShadowRanger encoding detector detected cp-1252 and the result seems to be legit
tripleee over 6 years

This has the potential of producing invalid results. CP-1252 will happily decode anything (audio data, core dumps, zip archives) and pretend it's all valid text.
tripleee over 6 years

Casual inspection of my download of txt.tsv indicates no 0xa0 character at the offset indicated in the question, but plenty of 0xa0 characters which are apparently representing hard spaces, and 0xac characters in a position which indicates a currency indicator as well as 0xae which apparently is the ®‎ symbol. This is almost consistent with CP1252 or ISO-8859-1 (which of course are very similar), but the 0xac doesn't fit with either. Maybe see also cdn.rawgit.com/tripleee/8bit/master/encodings.html#ac (cough.)