How to open a huge excel file efficiently

22,253

Solution 1

Well, if your excel is going to be as simple as a CSV file like your example (https://drive.google.com/file/d/0B_CXvCTOo7_2UVZxbnpRaEVnaFk/view?usp=sharing), you can try to open the file as a zip file and read directly every xml:

Intel i5 4460, 12 GB RAM, SSD Samsung EVO PRO.

If you have a lot of memory ram: This code needs a lot of ram, but it takes 20~25 seconds. (You need the parameter -Xmx7g)

package com.devsaki.opensimpleexcel;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.Charset;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.zip.ZipFile;

public class Multithread {

    public static final char CHAR_END = (char) -1;

    public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
        String excelFile = "C:/Downloads/BigSpreadsheetAllTypes.xlsx";
        ZipFile zipFile = new ZipFile(excelFile);
        long init = System.currentTimeMillis();
        ExecutorService executor = Executors.newFixedThreadPool(4);
        char[] sheet1 = readEntry(zipFile, "xl/worksheets/sheet1.xml").toCharArray();
        Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(new CharReader(sheet1), executor));
        char[] sharedString = readEntry(zipFile, "xl/sharedStrings.xml").toCharArray();
        Future<String[]> futureWords = executor.submit(() -> processSharedStrings(new CharReader(sharedString)));

        Object[][] sheet = futureSheet1.get();
        String[] words = futureWords.get();

        executor.shutdown();

        long end = System.currentTimeMillis();
        System.out.println("only read: " + (end - init) / 1000);

        ///Doing somethin with the file::Saving as csv
        init = System.currentTimeMillis();
        try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {
            for (Object[] rows : sheet) {
                for (Object cell : rows) {
                    if (cell != null) {
                        if (cell instanceof Integer) {
                            writer.append(words[(Integer) cell]);
                        } else if (cell instanceof String) {
                            writer.append(toDate(Double.parseDouble(cell.toString())));
                        } else {
                            writer.append(cell.toString()); //Probably a number
                        }
                    }
                    writer.append(";");
                }
                writer.append("\n");
            }
        }
        end = System.currentTimeMillis();
        System.out.println("Main saving to csv: " + (end - init) / 1000);
    }

    private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;
    private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);

    //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date
    public static String toDate(double s) {
        return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));
    }

    public static String readEntry(ZipFile zipFile, String entry) throws IOException {
        System.out.println("Initialing readEntry " + entry);
        long init = System.currentTimeMillis();
        String result = null;

        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            br.readLine();
            result = br.readLine();
        }

        long end = System.currentTimeMillis();
        System.out.println("readEntry '" + entry + "': " + (end - init) / 1000);
        return result;
    }


    public static String[] processSharedStrings(CharReader br) throws IOException {
        System.out.println("Initialing processSharedStrings");
        long init = System.currentTimeMillis();
        String[] words = null;
        char[] wordCount = "Count=\"".toCharArray();
        char[] token = "<t>".toCharArray();
        String uniqueCount = extractNextValue(br, wordCount, '"');
        words = new String[Integer.parseInt(uniqueCount)];
        String nextWord;
        int currentIndex = 0;
        while ((nextWord = extractNextValue(br, token, '<')) != null) {
            words[currentIndex++] = nextWord;
            br.skip(11); //you can skip at least 11 chars "/t></si><si>"
        }
        long end = System.currentTimeMillis();
        System.out.println("SharedStrings: " + (end - init) / 1000);
        return words;
    }


    public static Object[][] processSheet1(CharReader br, ExecutorService executorService) throws IOException, ExecutionException, InterruptedException {
        System.out.println("Initialing processSheet1");
        long init = System.currentTimeMillis();
        char[] dimensionToken = "dimension ref=\"".toCharArray();
        String dimension = extractNextValue(br, dimensionToken, '"');
        int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);
        br.skip(30); //Between dimension and next tag c exists more or less 30 chars
        Object[][] result = new Object[sizes[0]][sizes[1]];

        int parallelProcess = 8;
        int currentIndex = br.currentIndex;
        CharReader[] charReaders = new CharReader[parallelProcess];
        int totalChars = Math.round(br.chars.length / parallelProcess);
        for (int i = 0; i < parallelProcess; i++) {
            int endIndex = currentIndex + totalChars;
            charReaders[i] = new CharReader(br.chars, currentIndex, endIndex, i);
            currentIndex = endIndex;
        }
        Future[] futures = new Future[parallelProcess];
        for (int i = charReaders.length - 1; i >= 0; i--) {
            final int j = i;
            futures[i] = executorService.submit(() -> inParallelProcess(charReaders[j], j == 0 ? null : charReaders[j - 1], result));
        }
        for (Future future : futures) {
            future.get();
        }

        long end = System.currentTimeMillis();
        System.out.println("Sheet1: " + (end - init) / 1000);
        return result;
    }

    public static void inParallelProcess(CharReader br, CharReader back, Object[][] result) {
        System.out.println("Initialing inParallelProcess : " + br.identifier);

        char[] tokenOpenC = "<c r=\"".toCharArray();
        char[] tokenOpenV = "<v>".toCharArray();

        char[] tokenAttributS = " s=\"".toCharArray();
        char[] tokenAttributT = " t=\"".toCharArray();

        String v;
        int firstCurrentIndex = br.currentIndex;
        boolean first = true;

        while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {
            if (first && back != null) {
                int sum = br.currentIndex - firstCurrentIndex - tokenOpenC.length - v.length() - 1;
                first = false;
                System.out.println("Adding to : " + back.identifier + " From : " + br.identifier);
                back.plusLength(sum);
            }
            int[] indexes = extractSizeFromDimention(v);

            int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);
            char type = 's'; //3 types: number (n), string (s) and date (d)
            if (s == 0) { // Token S = number or date
                char read = br.read();
                if (read == '1') {
                    type = 'n';
                } else {
                    type = 'd';
                }
            } else if (s == -1) {
                type = 'n';
            }
            String c = extractNextValue(br, tokenOpenV, '<');
            Object value = null;
            switch (type) {
                case 'n':
                    value = Double.parseDouble(c);
                    break;
                case 's':
                    try {
                        value = Integer.parseInt(c);
                    } catch (Exception ex) {
                        System.out.println("Identifier Error : " + br.identifier);
                    }
                    break;
                case 'd':
                    value = c.toString();
                    break;
            }
            result[indexes[0] - 1][indexes[1] - 1] = value;
            br.skip(7); ///v></c>
        }
    }

    static class CharReader {
        char[] chars;
        int currentIndex;
        int length;

        int identifier;

        public CharReader(char[] chars) {
            this.chars = chars;
            this.length = chars.length;
        }

        public CharReader(char[] chars, int currentIndex, int length, int identifier) {
            this.chars = chars;
            this.currentIndex = currentIndex;
            if (length > chars.length) {
                this.length = chars.length;
            } else {
                this.length = length;
            }
            this.identifier = identifier;
        }

        public void plusLength(int n) {
            if (this.length + n <= chars.length) {
                this.length += n;
            }
        }

        public char read() {
            if (currentIndex >= length) {
                return CHAR_END;
            }
            return chars[currentIndex++];
        }

        public void skip(int n) {
            currentIndex += n;
        }
    }


    public static int[] extractSizeFromDimention(String dimention) {
        StringBuilder sb = new StringBuilder();
        int columns = 0;
        int rows = 0;
        for (char c : dimention.toCharArray()) {
            if (columns == 0) {
                if (Character.isDigit(c)) {
                    columns = convertExcelIndex(sb.toString());
                    sb = new StringBuilder();
                }
            }
            sb.append(c);
        }
        rows = Integer.parseInt(sb.toString());
        return new int[]{rows, columns};
    }

    public static int foundNextTokens(CharReader br, char until, char[]... tokens) {
        char character;
        int[] indexes = new int[tokens.length];
        while ((character = br.read()) != CHAR_END) {
            if (character == until) {
                break;
            }
            for (int i = 0; i < indexes.length; i++) {
                if (tokens[i][indexes[i]] == character) {
                    indexes[i]++;
                    if (indexes[i] == tokens[i].length) {
                        return i;
                    }
                } else {
                    indexes[i] = 0;
                }
            }
        }

        return -1;
    }

    public static String extractNextValue(CharReader br, char[] token, char until) {
        char character;
        StringBuilder sb = new StringBuilder();
        int index = 0;

        while ((character = br.read()) != CHAR_END) {
            if (index == token.length) {
                if (character == until) {
                    return sb.toString();
                } else {
                    sb.append(character);
                }
            } else {
                if (token[index] == character) {
                    index++;
                } else {
                    index = 0;
                }
            }
        }
        return null;
    }

    public static int convertExcelIndex(String index) {
        int result = 0;
        for (char c : index.toCharArray()) {
            result = result * 26 + ((int) c - (int) 'A' + 1);
        }
        return result;
    }
}

Old answer (Not need the parameter Xms7g, so take less memory): It takes to open and read the example file about 35 seconds (200MB) with an HDD, with SDD takes a little less (30 seconds).

Here the code: https://github.com/csaki/OpenSimpleExcelFast.git

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.Charset;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.zip.ZipFile;

public class Launcher {

    public static final char CHAR_END = (char) -1;

    public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
        long init = System.currentTimeMillis();
        String excelFile = "D:/Downloads/BigSpreadsheet.xlsx";
        ZipFile zipFile = new ZipFile(excelFile);

        ExecutorService executor = Executors.newFixedThreadPool(4);
        Future<String[]> futureWords = executor.submit(() -> processSharedStrings(zipFile));
        Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(zipFile));
        String[] words = futureWords.get();
        Object[][] sheet1 = futureSheet1.get();
        executor.shutdown();

        long end = System.currentTimeMillis();
        System.out.println("Main only open and read: " + (end - init) / 1000);


        ///Doing somethin with the file::Saving as csv
        init = System.currentTimeMillis();
        try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {
            for (Object[] rows : sheet1) {
                for (Object cell : rows) {
                    if (cell != null) {
                        if (cell instanceof Integer) {
                            writer.append(words[(Integer) cell]);
                        } else if (cell instanceof String) {
                            writer.append(toDate(Double.parseDouble(cell.toString())));
                        } else {
                            writer.append(cell.toString()); //Probably a number
                        }
                    }
                    writer.append(";");
                }
                writer.append("\n");
            }
        }
        end = System.currentTimeMillis();
        System.out.println("Main saving to csv: " + (end - init) / 1000);
    }

    private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;
    private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);

    //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date
    public static String toDate(double s) {
        return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));
    }

    public static Object[][] processSheet1(ZipFile zipFile) throws IOException {
        String entry = "xl/worksheets/sheet1.xml";
        Object[][] result = null;
        char[] dimensionToken = "dimension ref=\"".toCharArray();
        char[] tokenOpenC = "<c r=\"".toCharArray();
        char[] tokenOpenV = "<v>".toCharArray();

        char[] tokenAttributS = " s=\"".toCharArray();
        char[] tokenAttributT = " t=\"".toCharArray();
        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            String dimension = extractNextValue(br, dimensionToken, '"');
            int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);
            br.skip(30); //Between dimension and next tag c exists more or less 30 chars
            result = new Object[sizes[0]][sizes[1]];
            String v;
            while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {
                int[] indexes = extractSizeFromDimention(v);

                int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);
                char type = 's'; //3 types: number (n), string (s) and date (d)
                if (s == 0) { // Token S = number or date
                    char read = (char) br.read();
                    if (read == '1') {
                        type = 'n';
                    } else {
                        type = 'd';
                    }
                } else if (s == -1) {
                    type = 'n';
                }
                String c = extractNextValue(br, tokenOpenV, '<');
                Object value = null;
                switch (type) {
                    case 'n':
                        value = Double.parseDouble(c);
                        break;
                    case 's':
                        value = Integer.parseInt(c);
                        break;
                    case 'd':
                        value = c.toString();
                        break;
                }
                result[indexes[0] - 1][indexes[1] - 1] = value;
                br.skip(7); ///v></c>
            }
        }
        return result;
    }

    public static int[] extractSizeFromDimention(String dimention) {
        StringBuilder sb = new StringBuilder();
        int columns = 0;
        int rows = 0;
        for (char c : dimention.toCharArray()) {
            if (columns == 0) {
                if (Character.isDigit(c)) {
                    columns = convertExcelIndex(sb.toString());
                    sb = new StringBuilder();
                }
            }
            sb.append(c);
        }
        rows = Integer.parseInt(sb.toString());
        return new int[]{rows, columns};
    }

    public static String[] processSharedStrings(ZipFile zipFile) throws IOException {
        String entry = "xl/sharedStrings.xml";
        String[] words = null;
        char[] wordCount = "Count=\"".toCharArray();
        char[] token = "<t>".toCharArray();
        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            String uniqueCount = extractNextValue(br, wordCount, '"');
            words = new String[Integer.parseInt(uniqueCount)];
            String nextWord;
            int currentIndex = 0;
            while ((nextWord = extractNextValue(br, token, '<')) != null) {
                words[currentIndex++] = nextWord;
                br.skip(11); //you can skip at least 11 chars "/t></si><si>"
            }
        }
        return words;
    }

    public static int foundNextTokens(BufferedReader br, char until, char[]... tokens) throws IOException {
        char character;
        int[] indexes = new int[tokens.length];
        while ((character = (char) br.read()) != CHAR_END) {
            if (character == until) {
                break;
            }
            for (int i = 0; i < indexes.length; i++) {
                if (tokens[i][indexes[i]] == character) {
                    indexes[i]++;
                    if (indexes[i] == tokens[i].length) {
                        return i;
                    }
                } else {
                    indexes[i] = 0;
                }
            }
        }

        return -1;
    }

    public static String extractNextValue(BufferedReader br, char[] token, char until) throws IOException {
        char character;
        StringBuilder sb = new StringBuilder();
        int index = 0;

        while ((character = (char) br.read()) != CHAR_END) {
            if (index == token.length) {
                if (character == until) {
                    return sb.toString();
                } else {
                    sb.append(character);
                }
            } else {
                if (token[index] == character) {
                    index++;
                } else {
                    index = 0;
                }
            }
        }
        return null;
    }

    public static int convertExcelIndex(String index) {
        int result = 0;
        for (char c : index.toCharArray()) {
            result = result * 26 + ((int) c - (int) 'A' + 1);
        }
        return result;
    }

}

Solution 2

Most programming languages that work with Office products have some middle layer and this is usually where the bottleneck is, a good example is using PIA's/Interop or Open XML SDK.

One way to get the data at a lower level (bypassing the middle layer) is using a Driver.

150MB one-sheet excel file that takes about 7 minutes.

The best I could do is a 130MB file in 135 seconds, roughly 3 times faster:

Stopwatch sw = new Stopwatch();
sw.Start();

DataSet excelDataSet = new DataSet();

string filePath = @"c:\temp\BigBook.xlsx";

// For .XLSXs we use =Microsoft.ACE.OLEDB.12.0;, for .XLS we'd use Microsoft.Jet.OLEDB.4.0; with  "';Extended Properties=\"Excel 8.0;HDR=YES;\"";
string connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + filePath + "';Extended Properties=\"Excel 12.0;HDR=YES;\"";

using (OleDbConnection conn = new OleDbConnection(connectionString))
{
    conn.Open();
    OleDbDataAdapter objDA = new System.Data.OleDb.OleDbDataAdapter
    ("select * from [Sheet1$]", conn);
    objDA.Fill(excelDataSet);
    //dataGridView1.DataSource = excelDataSet.Tables[0];
}
sw.Stop();
Debug.Print("Load XLSX tool: " + sw.ElapsedMilliseconds + " millisecs. Records = "  + excelDataSet.Tables[0].Rows.Count);

enter image description here

Win 7x64, Intel i5, 2.3ghz, 8GB ram, SSD250GB.

If I could recommend a hardware solution as well, try to resolve it with an SSD if you're using standard HDD's.

Note: I cant download your Excel spreadsheet example as I'm behind a corporate firewall.

PS. See MSDN - Fastest Way to import xlsx files with 200 MB of Data, the consensus being OleDB is the fastest.

PS 2. Here's how you can do it with python: http://code.activestate.com/recipes/440661-read-tabular-data-from-excel-spreadsheets-the-fast/

Solution 3

I managed to read the file in about 30 seconds using .NET core and the Open XML SDK.

The following example returns a list of objects containing all rows and cells with the matching types, it supports date, numeric and text cells. The project is available here: https://github.com/xferaa/BigSpreadSheetExample/ (Should work on Windows, Linux and Mac OS and does not require Excel or any Excel component to be installed).

public List<List<object>> ParseSpreadSheet()
{
    List<List<object>> rows = new List<List<object>>();

    using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false))
    {
        WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
        WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();

        OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);

        Dictionary<int, string> sharedStringCache = new Dictionary<int, string>();

        int i = 0;
        foreach (var el in workbookPart.SharedStringTablePart.SharedStringTable.ChildElements)
        {
            sharedStringCache.Add(i++, el.InnerText);
        }

        while (reader.Read())
        {
            if(reader.ElementType == typeof(Row))
            {
                reader.ReadFirstChild();

                List<object> cells = new List<object>();

                do
                {
                    if (reader.ElementType == typeof(Cell))
                    {
                        Cell c = (Cell)reader.LoadCurrentElement();

                        if (c == null || c.DataType == null || !c.DataType.HasValue)
                            continue;

                        object value;

                        switch(c.DataType.Value)
                        {
                            case CellValues.Boolean:
                                value = bool.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.Date:
                                value = DateTime.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.Number:
                                value = double.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.InlineString:
                            case CellValues.String:
                                value = c.CellValue.InnerText;
                                break;
                            case CellValues.SharedString:
                                value = sharedStringCache[int.Parse(c.CellValue.InnerText)];
                                break;
                            default:
                                continue;
                        }

                        if (value != null)
                            cells.Add(value);
                    }

                } while (reader.ReadNextSibling());

                if (cells.Any())
                    rows.Add(cells);
            }
        }
    }

    return rows;
}

I ran the program in a three year old Laptop with a SSD drive, 8GB of RAM and an Intel Core i7-4710 CPU @ 2.50GHz (two cores) on Windows 10 64 bits.

Note that although opening and parsing the whole file as strings takes a bit less than 30 seconds, when using objects as in the example of my last edit, the time goes up to almost 50 seconds with my crappy laptop. You will probably get closer to 30 seconds in your server with Linux.

The trick was to use the SAX approach as explained here:

https://msdn.microsoft.com/en-us/library/office/gg575571.aspx

Solution 4

Python's Pandas library could be used to hold and process your data, but using it to directly load the .xlsx file will be quite slow, e.g. using read_excel().

One approach would be to use Python to automate the conversion of your file into CSV using Excel itself and to then use Pandas to load the resulting CSV file using read_csv(). This will give you a good speed up, but not under 30 seconds:

import win32com.client as win32        
import pandas as pd    
from datetime import datetime    

print ("Starting")
start = datetime.now()

# Use Excel to load the xlsx file and save it in csv format
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(r'c:\full path\BigSpreadsheet.xlsx')
excel.DisplayAlerts = False
wb.DoNotPromptForConvert = True
wb.CheckCompatibility = False

print('Saving')
wb.SaveAs(r'c:\full path\temp.csv', FileFormat=6, ConflictResolution=2) 
excel.Application.Quit()

# Use Pandas to load the resulting CSV file
print('Loading CSV')
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str)

print(df.shape)
print("Done", datetime.now() - start)

Column types
The types for your columns can be specified by passing dtype and converters and parse_dates:

df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[8], infer_datetime_format=True)

You should also specify infer_datetime_format=True, as this will greatly speed up the date conversion.

nfer_datetime_format : boolean, default False

If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.

Also add dayfirst=True if dates are in the form DD/MM/YYYY.

Selective columns
If you actually only need to work on columns 1 9 11, then you could further reduce resources by specifying usecols=[0, 8, 10] as follows:

df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[1], dayfirst=True, infer_datetime_format=True, usecols=[0, 8, 10])

The resulting dataframe would then only contain those 3 columns of data.

RAM drive
Using a RAM drive to store the temporary CSV file to would further speed up the load time.

Note: This does assume you are using a Windows PC with Excel available.

Solution 5

I have created an sample Java program which is able to load the file in ~40 seconds my laptop ( Intel i7 4 core, 16 GB RAM).

https://github.com/skadyan/largefile

This program uses the Apache POI library to load the .xlsx file using the XSSF SAX API.

The callback interface com.stackoverlfow.largefile.RecordHandler implementation can be used to process the data loaded from the excel. This interface define only one method which take three arguments

  • sheetname : String, excel sheet name
  • row number: int, row number of data
  • and data map: Map: excel cell reference and excel formatted cell value

The class com.stackoverlfow.largefile.Main demonstrate one basic implementation of this interface which just print the row number on console.

Update

woodstox parser seems have better performance than standard SAXReader. (code updated in repo).

Also in order to meet the desired performance requirement, you may consider to re-implement the org.apache.poi...XSSFSheetXMLHandler. In the implementation, more optimized string/text value handling can be implemented and unnecessary text formatting operation may be skipped.

Share:
22,253
David542
Author by

David542

Updated on October 03, 2020

Comments

  • David542
    David542 over 3 years

    I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:

    # using python
    import xlrd
    wb = xlrd.open_workbook(file)
    sh = wb.sheet_by_index(0)
    

    Is there any way to open the excel file quicker? I'm open to even very outlandish suggestions (such as hadoop, spark, c, java, etc.). Ideally I'm looking for a way to open the file in under 30 seconds if that's not a pipe dream. Also, the above example is using python, but it doesn't have to be python.


    Note: this is an Excel file from a client. It cannot be converted into any other format before we receive it. It is not our file


    UPDATE: Answer with a working example of code that will open the following 200MB excel file in under 30 seconds will be rewarded with bounty: https://drive.google.com/file/d/0B_CXvCTOo7_2VW9id2VXRWZrbzQ/view?usp=sharing. This file should have string (col 1), date (col 9), and number (col 11).

  • David542
    David542 over 6 years
    This seems to only load the necessary sheets, but in the above case there is only one sheet to load, so the on_demand saves no time (I tested it).
  • David542
    David542 over 6 years
    Thanks, I'll test this out -- want to put the GitHub code for the main file, Launcher.java, in the answer itself? I think that might be more helpful for future people that may want to view it.
  • David542
    David542 over 6 years
    Thanks for this answer -- could you time it with the sample excel file and see how it performs?
  • Martin Evans
    Martin Evans over 6 years
    Running it on my machine gives ~60s. In comparison, I get 7m30s using just xlrd.
  • devsaki
    devsaki over 6 years
    Well, I check what happens when the cell is not a string and the code is not going to work, but implementing that features is not hard (only check what the attribute t is in c tag. "s" => string). The cell styles shouldn't break the code.
  • David542
    David542 over 6 years
    got it -- I would think the three types we'll need are (1) string; (2) number; and (3) date. I know Excel stores a date as a number but we'd need to be able to know that a cell is a date somehow. What does the timing take allowing (2) and (3)?
  • devsaki
    devsaki over 6 years
    Ok, I added suppport to date and numbers
  • David542
    David542 over 6 years
    great, I'll also update the question/file so that it includes numeric and date fields so that it can be tested easily.
  • devsaki
    devsaki over 6 years
    The updated file doesn't have dates. But I already updated my answer supporting numbers and date types, didn't work? I tested with your new file and it's working here.
  • ian0411
    ian0411 over 6 years
    Use Python Pandas is a superb way to convert Excel to CSV. However, I was able to save more than 30 extra seconds with my slow computer (148s down to 113s, i.stack.imgur.com/EzbvR.jpg) with running VBScript in Python. But OP may not like it so just leave this in the comment.
  • David542
    David542 over 6 years
    very neat approach, thank you. What do you mean by using objects ? Is this a non-string, or what?
  • Isma
    Isma over 6 years
    No problem , this is fun stuff ;) The value from each cell of the Excel file (which is stored as XML text) is converted to an object according the data type of the cell so you end up having a list of a list of objects. The first list contains each row and each row contains a second list with each cell. This is a bit slower but it will pay off when you want to do something with the data and save it to the database. For example, you could use entity framework bulk import to import all objects at once.
  • David542
    David542 over 6 years
    thanks for this. So what would be the whole code to 'read' the xls(x) file and get it in memory from the above xls file?
  • obgnaw
    obgnaw over 6 years
  • MineR
    MineR over 6 years
    This is likely the fastest possible, without unzipping and parsing the XML manually, which will be a lot of code, and difficult to maintain.
  • Isma
    Isma over 6 years
    This is still using ADO internally so the results won't be much faster than the ADO.NET version which by the way is getting most of the votes but it is not the fastest / cleanest solution.
  • Isma
    Isma over 6 years
    I think so too, yet people seem to be leaning towards the OleDb solution which is at least 50% slower and IMHO less clean.
  • obgnaw
    obgnaw over 6 years
    @Isma There are two bottleneck, middle layer and file IO.The quickest version need assumption.and i think OP finally will need RAM Drive to reduce time of file IO.
  • Kul-Tigin
    Kul-Tigin over 6 years
    @Isma case CellValues.Date: is Office 2010 only data type. Casual dates stored as shared string with a correspoing double value to convert an OADate with 1900 or 1904 epoch, no data type present for cells. Phew. You'll need to check StyleIndex of the cells to format date or number (or percent etc..) So there's a lot work in fact.
  • Kul-Tigin
    Kul-Tigin over 6 years
    @Isma one last thing. Empty cells are not present in the xml document, you also need to handle this. Looking cell references in the loop would help to detect missing ones (jumping cell A1 to D1 gives a clue), so you can add empty items to your cells list. Eventually this is not a format easy to handle. There are too many tricky parts like these.
  • David542
    David542 over 6 years
    actually this is the exact implementation we are currently using to parse the file. So looking for a solution that's faster than this.
  • skadya
    skadya over 6 years
    To improve the performance what I can suggest you to try to use different Xml parser (e.g. github.com/FasterXML/woodstox). I observe slightly better results with this parser. see updated answer.
  • devsaki
    devsaki over 6 years
    I added another solution, it takes less time, but it needs a lot of heap memory.