How to stop reading a spreadsheet with xlrd at the first empty row?

24,300

To check if a single cell is empty, inspect its ctype attribute. To check an entire row, use the all function on a list comprehension:

workbook = xlrd.open_workbook(filepath)
sheet = workbook.sheets()[0]
rows = sheet.get_rows()
next(rows) # skip first row
for row in rows:
    if all([cell.ctype in (xlrd.XL_CELL_EMPTY, xlrd.XL_CELL_BLANK)
            for cell in row]):
        break
    # process this non-empty row here...
Share:
24,300
auslander
Author by

auslander

Geospatial analyst/devop with a preference for Esri but pragmatic enough to use open source where it's faster and easier.

Updated on September 11, 2020

Comments

  • auslander
    auslander over 3 years

    I'm using xlrd to sniff through a directory structure and pull out spreadsheets, reading the second row (row 1) down to "do stuff." The issue is that I don't know how to stop the reading/printing at the first empty row. I understand that rows are not "empty" objects, but I'd appreciate a little help showing how to check if all cells are empty. Here is the code I'm working with:

    import xlrd
    import os
    
    def excel_file_filter(filename, extensions=['.xls', '.xlsx']):
        return any(filename.endswith(e) for e in extensions)
    
    def get_filenames(root):
        filename_list = []
        for path, subdirs, files in os.walk(root):
            for filename in filter(excel_file_filter, files):
                filename_list.append(os.path.join(path, filename))
        return filename_list
    
    spreadsheets = get_filenames('C:\\Temp')
    for s in spreadsheets:
        with xlrd.open_workbook(s) as wb:
            cs = wb.sheet_by_index(0)
            num_cols = cs.ncols
            for row_index in range(1, cs.nrows):
                print('Row: {}'.format(row_index))
                for col_index in range(0, num_cols):
                    cell_object = cs.cell(row_index, col_index)
                    if cell_obj is not xlrd.empty_cell:
                        print('Col #: {} | Value: {}'.format(col_index, cell_obj))
    

    What ends up happening is that it prints all the way through almost 1000 rows, when only the first say, 25 rows have content in them. The amount of content between the spreadsheets varies, so a generic solution (that does not rely on other optional libraries) that helps me understand how to detect an empty row and then break, would be appreciated.