Programmatically extract data from an Excel spreadsheet

17,898

Solution 1

Maybe xlrd will do the Job (in Python)

edit: I should really learn to read questions. But writing csv shouldn't be a huge problem so maybe you can actually use it.

Solution 2

for ruby, the spreadsheet gem is excellent to read write modify, ...excell files

https://github.com/zdavatz/spreadsheet

Solution 3

You can use pyexcelerator in python.

This code (included in the examples folder of pyexcelerator as xls2csv.py) extracts all sheets from the spreadsheets and outputs them to stdout as CSV.

You can easily change the code to do what you want.

The cool thing about pyexcelerator is that you can also use it to write/create excel xls files, without having excel installed.

#!/usr/bin/env python
# -*- coding: windows-1251 -*-
# Copyright (C) 2005 Kiseliov Roman

__rev_id__ = """$Id: xls2csv.py,v 1.1 2005/05/19 09:27:42 rvk Exp $"""


from pyExcelerator import *
import sys

me, args = sys.argv[0], sys.argv[1:]


if args:
    for arg in args:
        print >>sys.stderr, 'extracting data from', arg
        for sheet_name, values in parse_xls(arg, 'cp1251'): # parse_xls(arg) -- default encoding
            matrix = [[]]
            print 'Sheet = "%s"' % sheet_name.encode('cp866', 'backslashreplace')
            print '----------------'
            for row_idx, col_idx in sorted(values.keys()):
                v = values[(row_idx, col_idx)]
                if isinstance(v, unicode):
                    v = v.encode('cp866', 'backslashreplace')
                else:
                    v = str(v)
                last_row, last_col = len(matrix), len(matrix[-1])
                while last_row < row_idx:
                    matrix.extend([[]])
                    last_row = len(matrix)

                while last_col < col_idx:
                    matrix[-1].extend([''])
                    last_col = len(matrix[-1])

                matrix[-1].extend([v])

            for row in matrix:
                csv_row = ','.join(row)
                print csv_row

else:
    print 'usage: %s (inputfile)+' % me

Solution 4

This is quite late to the game, but I thought I'd add another option via Ruby using the gem "roo":

    require 'rubygems'
    require 'roo'

    my_excel_file = Excelx.new("path/to/my_excel_file.xlsx")
    my_excel_file.default_sheet = my_excel_file.sheets[2]
    my_excel_file.to_csv("path/to/my_excel_file.csv")

Solution 5

In Ruby, here is the code I use: (requires the excellent parseexcel gem) require 'parseexcel'

def excelGetSheet(worksheet)
    sheet=Array.new
    worksheet.each { |row|
      if row != nil   # empty row?
        cells=Array.new
        j=0
        row.each { |cell|
          cells << cell.to_s('latin1')  unless cell == nil
          j=j+1
        }
        sheet << cells
      end
    }
    return sheet
end

workbook = Spreadsheet::ParseExcel.parse("MyExcelFile.xls")
sheet1 = excelGetSheet(workbook.worksheet(0))

puts sheet1.inspect
Share:
17,898
dreeves
Author by

dreeves

Startup: Beeminder.com Blog: MessyMatters.com Homepage: Dreev.es Twitter.com/dreev Favorite programming language: Mathematica Random fact: Dreeves is an ultra-marathon inline skater

Updated on June 17, 2022

Comments

  • dreeves
    dreeves almost 2 years

    Is there a simple way, using some common Unix scripting language (Perl/Python/Ruby) or command line utility, to convert an Excel Spreadsheet file to CSV? Specifically, this one:

    http://www.econ.yale.edu/~shiller/data/ie_data.xls

    And specifically the third sheet of that spreadsheet (the first two being charts).

  • John Machin
    John Machin over 14 years
    pyXLreader appears not to be have been maintained since 2005, has an astonishing outstanding bug in the sourceforge tracker (omits fractions, all numbers presented as integers) plus a few more problems I found in a few minutes of playing with it (crashes on large files, sometimes prints "FCUKDATE" (no kidding!) when there are dates in the data, time taken is at least 6 times that of xlrd and is possibly O(N**2), memory used is at least 7 times that of xlrd). Do you use it very much yourself?
  • John Machin
    John Machin over 14 years
    What are those 'cp1251' and 'cp866' doing in there? What will happen if there are dates in the data? Why does it use alist.extend([something]) instead of alist.append(something)? Why does it use ','.join(row) instead of using the csv module?? Are you serious???
  • nosklo
    nosklo over 14 years
    @John: I just copied the example of pyexcelerator. I think the point of the example is reading the xls file, not the csv generation part. It works. You can modify and improve it as needed.
  • John Machin
    John Machin over 14 years
    A review? I could't find such a thing using google. Please post the link.
  • dreeves
    dreeves over 14 years
    Right, outputting as CSV is no big deal. Thanks!
  • dreeves
    dreeves over 14 years
    Oh, I didn't mean it as a Language Wars question or anything. Just needed to get that task done and was happy to use whatever language made it easy.
  • poseid
    poseid about 14 years
    more explanations is also found here: codeweblog.com/parseexcel-ruby-used-to-read-excel
  • Alfabravo
    Alfabravo over 11 years
    And how to choose third sheet?
  • vikjam
    vikjam over 11 years
    Ah fair point. Before the "to_csv" line, set the default sheet to the third sheet, i.e. my_excel_file.default_sheet = my_excel_file.sheets[2]