UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte, while reading csv file in pandas
Solution 1
It's still most likely gzipped data. gzip's magic number is 0x1f 0x8b
, which is consistent with the UnicodeDecodeError
you get.
You could try decompressing the data on the fly:
with open('destinations.csv', 'rb') as fd:
gzip_fd = gzip.GzipFile(fileobj=fd)
destinations = pd.read_csv(gzip_fd)
Or use pandas' built-in gzip support:
destinations = pd.read_csv('destinations.csv', compression='gzip')
Solution 2
Try including this encoding while reading the csv file
pd.read_csv('csv_file', encoding='ISO-8859–1')
Solution 3
Can you try using codecs
import codecs
with codecs.open("destinations.csv", "r",encoding='utf-8', errors='ignore') as file_dat:
destinations = pd.read_csv(file_data))
Related videos on Youtube
Comments
-
shubham_827 almost 2 years
I know similar questions has been asked already I have seen all of them and tried but of little help. I am using OSX 10.11 El Capitan, python3.6., virtual environment, tried without that also. I am using jupyter notebook and spyder3.
I am new to python, but know basic ML and following a post to learn how to solve Kaggle challenges: Link to Blog, Link to Data Set
.I am stuck at the first few lines of code `
import pandas as pd destinations = pd.read_csv("destinations.csv") test = pd.read_csv("test.csv") train = pd.read_csv("train.csv")
and it is giving me error
UnicodeDecodeError Traceback (most recent call last) <ipython-input-19-a928a98eb1ff> in <module>() 1 import pandas as pd ----> 2 df = pd.read_csv('destinations.csv', compression='infer',date_parser=True, usecols=([0,1,3])) 3 df.head() /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision) 653 skip_blank_lines=skip_blank_lines) 654 --> 655 return _read(filepath_or_buffer, kwds) 656 657 parser_f.__name__ = name /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 403 404 # Create the parser. --> 405 parser = TextFileReader(filepath_or_buffer, **kwds) 406 407 if chunksize or iterator: /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds) 762 self.options['has_index_names'] = kwds['has_index_names'] 763 --> 764 self._make_engine(self.engine) 765 766 def close(self): /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine) 983 def _make_engine(self, engine='c'): 984 if engine == 'c': --> 985 self._engine = CParserWrapper(self.f, **self.options) 986 else: 987 if engine == 'python': /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds) 1603 kwds['allow_leading_cols'] = self.index_col is not False 1604 -> 1605 self._reader = parsers.TextReader(src, **kwds) 1606 1607 # XXX pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:6175)() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._get_header (pandas/_libs/parsers.c:9691)() UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
Some answers on stakoverflow suggested that it is because it is gzipped, but Chrome downloaded the .csv file and .csv.gz was nowhere to be seen and returned file not found error.
I then read somewhere to use
encoding='latin1'
, but after doing this I am getting parser error:--------------------------------------------------------------------------- ParserError Traceback (most recent call last) <ipython-input-21-f9c451f864a2> in <module>() 1 import pandas as pd 2 ----> 3 destinations = pd.read_csv("destinations.csv",encoding='latin1') 4 test = pd.read_csv("test.csv") 5 train = pd.read_csv("train.csv") /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision) 653 skip_blank_lines=skip_blank_lines) 654 --> 655 return _read(filepath_or_buffer, kwds) 656 657 parser_f.__name__ = name /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 409 410 try: --> 411 data = parser.read(nrows) 412 finally: 413 parser.close() /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows) 1003 raise ValueError('skipfooter not supported for iteration') 1004 -> 1005 ret = self._engine.read(nrows) 1006 1007 if self.options.get('as_recarray'): /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows) 1746 def read(self, nrows=None): 1747 try: -> 1748 data = self._reader.read(nrows) 1749 except StopIteration: 1750 if self._first_chunk: pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138)() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)() pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)() ParserError: Error tokenizing data. C error: Expected 2 fields in line 11, saw 3
I have spent hours to debug this, tried to open the csv files on Atom( no other app could open it), online web-apps(some crashed) but of no help.I have tried using the kernels of other people who have solved the problem, but of no help.
-
juanpa.arrivillaga almost 7 yearsWhat's the separator?
-
shubham_827 almost 7 yearsI don't know. I am new to all these. I just downloaded the dataset as was given in the post and tried to execute the lines, but got an error.I don't know how to know the separator, I have mentioned the link at the top maybe you can find. Thanks
-
-
shubham_827 almost 7 yearsI tried but getting error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 11, saw 3
-
shubham_827 almost 7 yearsThanks! It worked.But I wanted to know one thing why it happened with me, other people who did this didn't get the error. Like see one submission: kaggle.com/benjaminabel/pandas-version-of-most-popular-hotels
-
dorian almost 7 yearsI would assume that submission only works on input files that are already unzipped.
-
shubham_827 almost 7 yearsBut my file show .csv extension same as what chrome downloaded. So how is it possible for it to be zipped? Shouldn't it be .csv.gz?
-
dorian almost 7 yearsLook I don't know about the specifics of your or that other guy's browser. The thing that matters here is that if the file is gzipped, you need to decompress it before you feed it to
pandas
. -
Goh Jia Yi about 2 yearsDid not work for my case. Any explanation on the encoding value used?