Python decoding excel sheet without pandas

10,772

You need to unzip the xlsx file first, before you can read its contents (assuming that is the format you are using).

Share:
10,772

Related videos on Youtube

jake wong
Author by

jake wong

Updated on June 04, 2022

Comments

  • jake wong
    jake wong over 1 year

    I am trying to read an excel file in python without using pandas or xlrd, and I have been trying to convert the results from bytes to utf-8 without any success.

    data from xls file

    colA    colB    colC
    spc     1D0     20190705
    spd     1D0     20190705
    spe     1D0     20190705
    ... (goes on for 500k lines)
    

    code

    with open(file, 'rb') as f:
        data = f.readlines(1)  # Just to check the first line that is printed out
        print(data[0].decode('utf-8'))
    

    The error I receive is UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

    If I were to print data without decoding it, the result is: [b'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00>\x00\x03\x00\xfe\xff\t\x00\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x9e\x00\x00\x00\x9dN\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\xfe\xff\xff\xff\x00\x00\x00\x00\xfeM\x00\x00\x01\x00\x00\x00\xffM\x00\x00\x00N\x00\x00\x01N\x00\x00\x02N\x00\x00\x03N\x00\x00\x04N\x00\x00\x05N\x00\x00\x06N\x00\x00\x07N\x00\x00\x08N\x00\x00\tN\x00\x00\n']

    There isn't any reason why I don't want to use pandas or xlrd, I am just trying to parse the data with just the standard libraries if required.

    Any thoughts?

    • amanb
      amanb over 4 years
      The error tells there is a specific character in the Excel file that cannot be decoded with 'utf-8'. Try using a different encoder, but still its not known what sort of characters maybe lurking around in the doc. Perhaps, you should give pandas a try: pd.read_excel(file) and see what you get.
    • lenz
      lenz over 4 years
      Excel is a binary format, not plain-text. If you don't want to use xlrd or pd.read_excel, you'll have to reimplement what those libraries do.
    • John Y
      John Y over 4 years
      Even if you want to parse .xlsx files, which are considerably easier than .xls, you still have quite a bit of work to do. I guess you are doing this as a learning exercise? If so, then I think you should take a look at this question to find out where to read about the .xlsx specifications. If you are truly trying to learn about .xls files, I urge you to reconsider. There are plenty of other things you could be learning about that are more useful and less painful.
  • lenz
    lenz over 4 years
    Ideally, you should show some code how to do this (eg. using the std-lib zipfile module) and then how to proceed, once the xlsx archive is unpacked (which file to process, how to access the data of a cell etc.)
  • pygri
    pygri over 4 years
    it would probably be wise to wait for a confirmation that xlsx is indeed the format the OP is trying to read before embarking in such an enterprise...
  • Eiríkr Útlendi
    Eiríkr Útlendi over 3 years
    See also this comment in another thread, presenting a solution to reading an `*.xlsx* Excel file using just standard library functionality.
  • George Crowther
    George Crowther almost 2 years
    From the description the OP has given (though they have not been specific), this does not appear to be answering the question posed. Your solution is for a text based file, the OP appears to be struggling with an (assumed) .xls or .xlsx file.