'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte
Solution 1
Encoding in the file is 'windows-1252'. Use:
open('txt.tsv', encoding='windows-1252')
Solution 2
If someone works on Turkish data, then I suggest this line:
df = pd.read_csv("text.txt",encoding='windows-1254')
Solution 3
ds = pd.read_csv('/Dataset/test.csv', encoding='windows-1252')
Works fine for me, thanks.
Solution 4
i have the same error message for .csv file, and This Worked for me :
df = pd.read_csv('Text.csv',encoding='ANSI')
Solution 5
I also encountered the same issue and worked while using latin1 encoding, refer to the sample code to apply in your codebase. Give a try if above resolution doesn't work.
df=pd.read_csv("../CSV_FILE.csv",na_values=missing, encoding='latin1')
Vital
Updated on February 09, 2022Comments
-
Vital about 2 years
I try to read and print the following file: txt.tsv (https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2017q3_notes.zip)
According to the SEC the data set is provided in a single encoding, as follows:
Tab Delimited Value (.txt): utf-8, tab-delimited, \n- terminated lines, with the first line containing the field names in lowercase.
My current code:
import csv with open('txt.tsv') as tsvfile: reader = csv.DictReader(tsvfile, dialect='excel-tab') for row in reader: print(row)
All attempts ended with the following error message:
'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte
I am a bit lost. Can anyone help me? Many thanks in advance.
-
Vital over 6 yearsThank you very much!! That works! May I ask you why it works with 'windows-1252' although the SEC states it is 'utf-8'?
-
ShadowRanger over 6 yearsAre you sure it's cp1252? The file I downloaded appeared to be ASCII. If it's not UTF-8, and not ASCII, it could be literally any single-byte-per-character ASCII superset and you'd only be able to guess at the encoding heuristically (it would successfully decode under any of them, but the results might be garbage).
-
koPytok over 6 years@Vital Better ask SEC
-
koPytok over 6 years@ShadowRanger encoding detector detected cp-1252 and the result seems to be legit
-
tripleee over 6 yearsThis has the potential of producing invalid results. CP-1252 will happily decode anything (audio data, core dumps, zip archives) and pretend it's all valid text.
-
tripleee over 6 yearsCasual inspection of my download of
txt.tsv
indicates no 0xa0 character at the offset indicated in the question, but plenty of 0xa0 characters which are apparently representing hard spaces, and 0xac characters in a position which indicates a currency indicator as well as 0xae which apparently is the ® symbol. This is almost consistent with CP1252 or ISO-8859-1 (which of course are very similar), but the 0xac doesn't fit with either. Maybe see also cdn.rawgit.com/tripleee/8bit/master/encodings.html#ac (cough.)