Read in .xlsx with csv module in python

40,382

Solution 1

You cannot use Python's csv library for reading xlsx formatted files. You need to install and use a different library. For example, you could use openpyxl as follows:

import openpyxl

wb = openpyxl.load_workbook("filelocation.xlsx")
ws = wb.active

for row in ws.iter_rows(values_only=True):
    print(row)

This would display all of the rows in the file as lists of row values. The Python Excel website gives other possible examples.


Alternatively you could create a list of rows:

import openpyxl

wb = openpyxl.load_workbook("filelocation.xlsx")
ws = wb.active

data = list(ws.iter_rows(values_only=True))

print(data)

Note: If you are using the older Excel format .xls, you could instead use the xlrd library. This no longer supports the .xlsx format though.

import xlrd

workbook = xlrd.open_workbook("filelocation.xlsx")
sheet = workbook.sheet_by_index(0)
data = [sheet.row_values(rowx) for rowx in range(sheet.nrows)]

print(data)

Solution 2

Here's a very very rough implementation using just the standard library.

def xlsx(fname, sheet=1):
    import zipfile
    from xml.etree.ElementTree import iterparse
    z = zipfile.ZipFile(fname)
    strings = [el.text for e, el in iterparse(z.open('xl/sharedStrings.xml')) if el.tag.endswith('}t')]
    rows = []
    row = {}
    value = ''
    for e, el in iterparse(z.open('xl/worksheets/sheet%s.xml' % sheet)):
        if el.tag.endswith('}v'):  # <v>84</v>
            value = el.text
        if el.tag.endswith('}c'):  # <c r="A3" t="s"><v>84</v></c>
            if el.attrib.get('t') == 's':
                value = strings[int(value)]
            column_name = ''.join(x for x in el.attrib['r'] if not x.isdigit())  # AZ22
            row[column_name] = value
            value = ''
        if el.tag.endswith('}row'):
            rows.append(row)
            row = {}
    return rows

(This is copied from a deleted question: https://stackoverflow.com/questions/4371163/reading-xlsx-files-using-python )

Solution 3

Here's a very very rough implementation using just the standard library.

def xlsx(fname):
    import zipfile
    from xml.etree.ElementTree import iterparse
    z = zipfile.ZipFile(fname)
    strings = [el.text for e, el in iterparse(z.open('xl/sharedStrings.xml')) if el.tag.endswith('}t')]
    rows = []
    row = {}
    value = ''
    for e, el in iterparse(z.open('xl/worksheets/sheet1.xml')):
        if el.tag.endswith('}v'):  # <v>84</v>
            value = el.text
        if el.tag.endswith('}c'):  # <c r="A3" t="s"><v>84</v></c>
            if el.attrib.get('t') == 's':
                value = strings[int(value)]
            letter = el.attrib['r'] # AZ22
            while letter[-1].isdigit():
                letter = letter[:-1]
            row[letter] = value
            value = ''
        if el.tag.endswith('}row'):
            rows.append(row)
            row = {}
    return rows

This answer is copied from a deleted question: https://stackoverflow.com/a/22067980/131881

Share:
40,382
pHorseSpec
Author by

pHorseSpec

IT Student. Gymnast. Musician. Human

Updated on July 09, 2022

Comments

  • pHorseSpec
    pHorseSpec almost 2 years

    I'm trying to read in an excel file with .xlsx formatting with the csv module, but I'm not having any luck with it when using an excel file even with my dialect and encoding specified. Below, I show my different attempts and error results with the different encodings I tried. If anyone could point me into the correct coding, syntax or module I could use to read in a .xlsx file in Python, I'd appreciate it.

    With the below code, I get the following error: _csv.Error: line contains NULL byte

    #!/usr/bin/python
    
    import sys, csv
    
    with open('filelocation.xlsx', "r+", encoding="Latin1")  as inputFile:
        csvReader = csv.reader(inputFile, dialect='excel')
        for row in csvReader:
            print(row)
    

    With the below code, I get the following error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 16: invalid continuation byte

    #!/usr/bin/python
    
    import sys, csv
    
    with open('filelocation.xlsx', "r+", encoding="Latin1")  as inputFile:
        csvReader = csv.reader(inputFile, dialect='excel')
        for row in csvReader:
            print(row)
    

    When I use utf-16 in the encoding, I get the following error: UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 570-571: illegal UTF-16 surrogate

  • John Y
    John Y about 8 years
    I like that you recommended xlrd, as I believe it's the best Excel reader. But if you want a list of the values in a row, that is more easily (and probably more efficiently) accomplished by simply cols = sheet.row_values(row) instead of your list comprehension. Also, I recommend using some other name for the row index (I'm partial to rx; you'll see a lot of examples use rowx) because the name row so often refers to a row object.
  • Shep Sims
    Shep Sims about 3 years
    You can no longer use xlrd on xlsx files <a href="url">link text</a><br><br> To read xlsx files either <br><br> (not recommended unless you have an outstanding reason to stick with xlrd)<br> convert the file into xls manually before processing<br><br> (recommended)<br> use the newer openpyxl python module as recommended <a href="python-excel.org/">here</a> (also recommended)<br> use pandas excel_file module