Read in .xlsx with csv module in python
Solution 1
You cannot use Python's csv
library for reading xlsx
formatted files. You need to install and use a different library. For example, you could use openpyxl
as follows:
import openpyxl
wb = openpyxl.load_workbook("filelocation.xlsx")
ws = wb.active
for row in ws.iter_rows(values_only=True):
print(row)
This would display all of the rows in the file as lists of row values. The Python Excel website gives other possible examples.
Alternatively you could create a list of rows:
import openpyxl
wb = openpyxl.load_workbook("filelocation.xlsx")
ws = wb.active
data = list(ws.iter_rows(values_only=True))
print(data)
Note: If you are using the older Excel format .xls
, you could instead use the xlrd
library. This no longer supports the .xlsx
format though.
import xlrd
workbook = xlrd.open_workbook("filelocation.xlsx")
sheet = workbook.sheet_by_index(0)
data = [sheet.row_values(rowx) for rowx in range(sheet.nrows)]
print(data)
Solution 2
Here's a very very rough implementation using just the standard library.
def xlsx(fname, sheet=1):
import zipfile
from xml.etree.ElementTree import iterparse
z = zipfile.ZipFile(fname)
strings = [el.text for e, el in iterparse(z.open('xl/sharedStrings.xml')) if el.tag.endswith('}t')]
rows = []
row = {}
value = ''
for e, el in iterparse(z.open('xl/worksheets/sheet%s.xml' % sheet)):
if el.tag.endswith('}v'): # <v>84</v>
value = el.text
if el.tag.endswith('}c'): # <c r="A3" t="s"><v>84</v></c>
if el.attrib.get('t') == 's':
value = strings[int(value)]
column_name = ''.join(x for x in el.attrib['r'] if not x.isdigit()) # AZ22
row[column_name] = value
value = ''
if el.tag.endswith('}row'):
rows.append(row)
row = {}
return rows
(This is copied from a deleted question: https://stackoverflow.com/questions/4371163/reading-xlsx-files-using-python )
Solution 3
Here's a very very rough implementation using just the standard library.
def xlsx(fname):
import zipfile
from xml.etree.ElementTree import iterparse
z = zipfile.ZipFile(fname)
strings = [el.text for e, el in iterparse(z.open('xl/sharedStrings.xml')) if el.tag.endswith('}t')]
rows = []
row = {}
value = ''
for e, el in iterparse(z.open('xl/worksheets/sheet1.xml')):
if el.tag.endswith('}v'): # <v>84</v>
value = el.text
if el.tag.endswith('}c'): # <c r="A3" t="s"><v>84</v></c>
if el.attrib.get('t') == 's':
value = strings[int(value)]
letter = el.attrib['r'] # AZ22
while letter[-1].isdigit():
letter = letter[:-1]
row[letter] = value
value = ''
if el.tag.endswith('}row'):
rows.append(row)
row = {}
return rows
This answer is copied from a deleted question: https://stackoverflow.com/a/22067980/131881
Comments
-
pHorseSpec almost 2 years
I'm trying to read in an excel file with .xlsx formatting with the
csv
module, but I'm not having any luck with it when using an excel file even with my dialect and encoding specified. Below, I show my different attempts and error results with the different encodings I tried. If anyone could point me into the correct coding, syntax or module I could use to read in a .xlsx file in Python, I'd appreciate it.With the below code, I get the following error:
_csv.Error: line contains NULL byte
#!/usr/bin/python import sys, csv with open('filelocation.xlsx', "r+", encoding="Latin1") as inputFile: csvReader = csv.reader(inputFile, dialect='excel') for row in csvReader: print(row)
With the below code, I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 16: invalid continuation byte
#!/usr/bin/python import sys, csv with open('filelocation.xlsx', "r+", encoding="Latin1") as inputFile: csvReader = csv.reader(inputFile, dialect='excel') for row in csvReader: print(row)
When I use
utf-16
in theencoding
, I get the following error:UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 570-571: illegal UTF-16 surrogate
-
John Y about 8 yearsI like that you recommended
xlrd
, as I believe it's the best Excel reader. But if you want a list of the values in a row, that is more easily (and probably more efficiently) accomplished by simplycols = sheet.row_values(row)
instead of your list comprehension. Also, I recommend using some other name for the row index (I'm partial torx
; you'll see a lot of examples userowx
) because the namerow
so often refers to a row object. -
Shep Sims about 3 yearsYou can no longer use xlrd on xlsx files <a href="url">link text</a><br><br> To read xlsx files either <br><br> (not recommended unless you have an outstanding reason to stick with xlrd)<br> convert the file into xls manually before processing<br><br> (recommended)<br> use the newer openpyxl python module as recommended <a href="python-excel.org/">here</a> (also recommended)<br> use pandas excel_file module