UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte, while reading csv file in pandas

71,225

Solution 1

It's still most likely gzipped data. gzip's magic number is 0x1f 0x8b, which is consistent with the UnicodeDecodeError you get.

You could try decompressing the data on the fly:

with open('destinations.csv', 'rb') as fd:
    gzip_fd = gzip.GzipFile(fileobj=fd)
    destinations = pd.read_csv(gzip_fd)

Or use pandas' built-in gzip support:

destinations = pd.read_csv('destinations.csv', compression='gzip')

Solution 2

Try including this encoding while reading the csv file

pd.read_csv('csv_file', encoding='ISO-8859–1')

Solution 3

Can you try using codecs

import codecs
with codecs.open("destinations.csv", "r",encoding='utf-8', errors='ignore') as file_dat:
     destinations = pd.read_csv(file_data))
Share:
71,225

Related videos on Youtube

shubham_827
Author by

shubham_827

Curious Software Engineer

Updated on July 09, 2022

Comments

  • shubham_827
    shubham_827 almost 2 years

    I know similar questions has been asked already I have seen all of them and tried but of little help. I am using OSX 10.11 El Capitan, python3.6., virtual environment, tried without that also. I am using jupyter notebook and spyder3.

    I am new to python, but know basic ML and following a post to learn how to solve Kaggle challenges: Link to Blog, Link to Data Set

    .I am stuck at the first few lines of code `

    import pandas as pd
    
    destinations = pd.read_csv("destinations.csv")
    test = pd.read_csv("test.csv")
    train = pd.read_csv("train.csv")
    

    and it is giving me error

    UnicodeDecodeError                        Traceback (most recent call last)
    <ipython-input-19-a928a98eb1ff> in <module>()
          1 import pandas as pd
    ----> 2 df = pd.read_csv('destinations.csv', compression='infer',date_parser=True, usecols=([0,1,3]))
          3 df.head()
    
    /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
        653                     skip_blank_lines=skip_blank_lines)
        654 
    --> 655         return _read(filepath_or_buffer, kwds)
        656 
        657     parser_f.__name__ = name
    
    /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
        403 
        404     # Create the parser.
    --> 405     parser = TextFileReader(filepath_or_buffer, **kwds)
        406 
        407     if chunksize or iterator:
    
    /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
        762             self.options['has_index_names'] = kwds['has_index_names']
        763 
    --> 764         self._make_engine(self.engine)
        765 
        766     def close(self):
    
    /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
        983     def _make_engine(self, engine='c'):
        984         if engine == 'c':
    --> 985             self._engine = CParserWrapper(self.f, **self.options)
        986         else:
        987             if engine == 'python':
    
    /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
       1603         kwds['allow_leading_cols'] = self.index_col is not False
       1604 
    -> 1605         self._reader = parsers.TextReader(src, **kwds)
       1606 
       1607         # XXX
    
    pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:6175)()
    
    pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._get_header (pandas/_libs/parsers.c:9691)()
    
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
    

    Some answers on stakoverflow suggested that it is because it is gzipped, but Chrome downloaded the .csv file and .csv.gz was nowhere to be seen and returned file not found error.

    I then read somewhere to use encoding='latin1', but after doing this I am getting parser error:

    ---------------------------------------------------------------------------
    ParserError                               Traceback (most recent call last)
    <ipython-input-21-f9c451f864a2> in <module>()
          1 import pandas as pd
          2 
    ----> 3 destinations = pd.read_csv("destinations.csv",encoding='latin1')
          4 test = pd.read_csv("test.csv")
          5 train = pd.read_csv("train.csv")
    
    /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
        653                     skip_blank_lines=skip_blank_lines)
        654 
    --> 655         return _read(filepath_or_buffer, kwds)
        656 
        657     parser_f.__name__ = name
    
    /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
        409 
        410     try:
    --> 411         data = parser.read(nrows)
        412     finally:
        413         parser.close()
    
    /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
       1003                 raise ValueError('skipfooter not supported for iteration')
       1004 
    -> 1005         ret = self._engine.read(nrows)
       1006 
       1007         if self.options.get('as_recarray'):
    
    /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
       1746     def read(self, nrows=None):
       1747         try:
    -> 1748             data = self._reader.read(nrows)
       1749         except StopIteration:
       1750             if self._first_chunk:
    
    pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)()
    
    pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138)()
    
    pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)()
    
    pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)()
    
    pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)()
    
    ParserError: Error tokenizing data. C error: Expected 2 fields in line 11, saw 3
    

    I have spent hours to debug this, tried to open the csv files on Atom( no other app could open it), online web-apps(some crashed) but of no help.I have tried using the kernels of other people who have solved the problem, but of no help.

    • juanpa.arrivillaga
      juanpa.arrivillaga almost 7 years
      What's the separator?
    • shubham_827
      shubham_827 almost 7 years
      I don't know. I am new to all these. I just downloaded the dataset as was given in the post and tried to execute the lines, but got an error.I don't know how to know the separator, I have mentioned the link at the top maybe you can find. Thanks
  • shubham_827
    shubham_827 almost 7 years
    I tried but getting error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 11, saw 3
  • shubham_827
    shubham_827 almost 7 years
    Thanks! It worked.But I wanted to know one thing why it happened with me, other people who did this didn't get the error. Like see one submission: kaggle.com/benjaminabel/pandas-version-of-most-popular-hotel‌​s
  • dorian
    dorian almost 7 years
    I would assume that submission only works on input files that are already unzipped.
  • shubham_827
    shubham_827 almost 7 years
    But my file show .csv extension same as what chrome downloaded. So how is it possible for it to be zipped? Shouldn't it be .csv.gz?
  • dorian
    dorian almost 7 years
    Look I don't know about the specifics of your or that other guy's browser. The thing that matters here is that if the file is gzipped, you need to decompress it before you feed it to pandas.
  • Goh Jia Yi
    Goh Jia Yi about 2 years
    Did not work for my case. Any explanation on the encoding value used?