Reading Excel file is magnitudes slower using openpyxl compared to xlrd

46,733

You can just iterate over the sheet:

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    rows = sheet.rows
    first_row = [cell.value for cell in next(rows)]
    data = []
    for row in rows:
        record = {}
        for key, cell in zip(first_row, row):
            if cell.data_type == 's':
                record[key] = cell.value.strip()
            else:
                record[key] = cell.value
        data.append(record)
    return data

This should scale to large files. You may want to chunk your result if the list data gets too large.

Now the openpyxl version takes about twice as long as the xlrd one:

%timeit xlrd_results = UseXlrd('foo.xlsx')
1 loops, best of 3: 3.38 s per loop

%timeit openpyxl_results = UseOpenpyxl('foo.xlsx')
1 loops, best of 3: 6.87 s per loop

Note that xlrd and openpyxl might interpret what is an integer and what is a float slightly differently. For my test data, I needed to add float() to make the outputs comparable:

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    rows = sheet.rows
    first_row = [float(cell.value) for cell in next(rows)]
    data = []
    for row in rows:
        record = {}
        for key, cell in zip(first_row, row):
            if cell.data_type == 's':
                record[key] = cell.value.strip()
            else:
                record[key] = float(cell.value)
        data.append(record)
    return data

Now, both versions give the same results for my test data:

>>> xlrd_results == openpyxl_results
True
Share:
46,733

Related videos on Youtube

Ron Johnson
Author by

Ron Johnson

Started my career as a developer, then moved to networking and systems administration. Never stopped looking for opportunities to use coding to help with menial tasks, but over the past 3 years it seems that I've come full circle and do more coding than anything now. Passionate about data analytics and visualization.

Updated on February 15, 2020

Comments

  • Ron Johnson
    Ron Johnson over 4 years

    I have an Excel spreadsheet that I need to import into SQL Server on a daily basis. The spreadsheet will contain around 250,000 rows across around 50 columns. I have tested both using openpyxl and xlrd using nearly identical code.

    Here's the code I'm using (minus debugging statements):

    import xlrd
    import openpyxl
    
    def UseXlrd(file_name):
        workbook = xlrd.open_workbook(file_name, on_demand=True)
        worksheet = workbook.sheet_by_index(0)
        first_row = []
        for col in range(worksheet.ncols):
            first_row.append(worksheet.cell_value(0,col))
        data = []
        for row in range(1, worksheet.nrows):
            record = {}
            for col in range(worksheet.ncols):
                if isinstance(worksheet.cell_value(row,col), str):
                    record[first_row[col]] = worksheet.cell_value(row,col).strip()
                else:
                    record[first_row[col]] = worksheet.cell_value(row,col)
            data.append(record)
        return data
    
    
    def UseOpenpyxl(file_name):
        wb = openpyxl.load_workbook(file_name, read_only=True)
        sheet = wb.active
        first_row = []
        for col in range(1,sheet.max_column+1):
            first_row.append(sheet.cell(row=1,column=col).value)
        data = []
        for r in range(2,sheet.max_row+1):
            record = {}
            for col in range(sheet.max_column):
                if isinstance(sheet.cell(row=r,column=col+1).value, str):
                    record[first_row[col]] = sheet.cell(row=r,column=col+1).value.strip()
                else:
                    record[first_row[col]] = sheet.cell(row=r,column=col+1).value
            data.append(record)
        return data
    
    xlrd_results = UseXlrd('foo.xls')
    openpyxl_resuts = UseOpenpyxl('foo.xls')
    

    Passing the same Excel file containing 3500 rows gives drastically different run times. Using xlrd I can read the entire file into a list of dictionaries in under 2 second. Using openpyxl I get the following results:

    Reading Excel File...
    Read 100 lines in 114.14509415626526 seconds
    Read 200 lines in 471.43183994293213 seconds
    Read 300 lines in 982.5288782119751 seconds
    Read 400 lines in 1729.3348784446716 seconds
    Read 500 lines in 2774.886833190918 seconds
    Read 600 lines in 4384.074863195419 seconds
    Read 700 lines in 6396.7723388671875 seconds
    Read 800 lines in 7998.775000572205 seconds
    Read 900 lines in 11018.460735321045 seconds
    

    While I can use xlrd in the final script, I will have to hard code a lot of formatting because of various issues (i.e. int reads as float, date reads as int, datetime reads as float). Being that I need to reuse this code for a few more imports, it doesn't make sense to try and hard code specific columns to format them properly and have to maintain similar code across 4 different scripts.

    Any advice on how to proceed?

    • Charlie Clark
      Charlie Clark about 8 years
      Mike has already provided the solution but here's the reason for poor performance: the way you're accessing cells is causing openpyxl to repeatedly parse the original spreadsheet. read-only mode is optimised for row-by-row access.
    • MaxU - stop genocide of UA
      MaxU - stop genocide of UA about 8 years
      when i read your description "I have an Excel spreadsheet that I need to import into SQL Server on a daily basis" - it sounds to me like a perfect candidate for Pandas: read about pandas.read_excel() and pandas.DataFrame.to_sql() functions. And AFAIK Pandas uses xlrd internally
    • Brandon Kuczenski
      Brandon Kuczenski over 2 years
      To follow-up on Charlie Clark's answer, the source of the behavior is in the use of max_column, which is implemented in an inefficient way, inside a loop. See: foss.heptapod.net/openpyxl/openpyxl/-/issues/1587
  • Charlie Clark
    Charlie Clark about 8 years
    Actually, you can just iterate over the sheet. Furthermore, openpyxl already does the type conversion so you can check the cell data_type.
  • Charlie Clark
    Charlie Clark about 8 years
    Also, it's probably worth noting that xlrd must read a file into memory, whereas openpyxl in read-only mode will allow you to stream row-by-row.
  • Charlie Clark
    Charlie Clark about 8 years
    Pandas uses xlrd internally and is pretty inflexible as a result. Note that this is of particular concern to the original poster.
  • Charlie Clark
    Charlie Clark about 8 years
    You might also see some performance improvements if you were testing with v2.0. The last time I compared the two I found openpyxl to be only slightly slower than xlrd: it's doing more and in constant memory.
  • Mike Müller
    Mike Müller about 8 years
    I am working with 2.3.2 on Python 3.5. This the latest version I can currently get via conda.
  • Charlie Clark
    Charlie Clark about 8 years
    Okay. Only minor changes in 2.3.3
  • Ibo
    Ibo over 6 years
    I agree with CharlieClark. I just wanted to mention that openpyxl has support for Pandas DataFrames. One can use the DataFrame() function from the Pandas package to put the values of a sheet into a DataFrame: import pandas as pd and then df = pd.DataFrame(sheet.values) so using both libraries to import and then work on the data is a better idea rather than just trying to choose one.
  • ohthepain
    ohthepain about 5 years
    I switched to iterating and find it to be orders of magnitude faster than before