Low memory writing/reading with Apache POI

11,813

Solution 1

To handle large data with low memory, the best and I think the only option is SXSSF api-s. If you need to read some data of the existing cells, I assume you do not need the entire 4M+ at the same time. In such a case based on your application requirement, you can handle the window size yourself and keep in memory only the amount of data you need at a particular time. You can start by looking at the example at : http://poi.apache.org/spreadsheet/how-to.html#sxssf

Something as

SXSSFWorkbook wb = new SXSSFWorkbook(-1); // turn off auto-flushing and accumulate all rows in memory
// manually control how rows are flushed to disk 
if(rownum % NOR == 0) {
((SXSSFSheet)sh).flushRows(NOR); // retain NOR last rows and flush all others

Hope this helps.

Solution 2

I used SAX parser to process events of the XML document presentation. This is

import com.sun.org.apache.xerces.internal.parsers.SAXParser;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;

import java.io.BufferedInputStream;
import java.io.InputStream;
import java.util.Collections;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;


public class LowMemoryExcelFileReader {

    private String file;

    public LowMemoryExcelFileReader(String file) {
        this.file = file;
    }

    public List<String[]> read() {
        try {
            return processFirstSheet(file);
        } catch (Exception e) {
           throw new RuntimeException(e);
        }
    }

    private List<String []> readSheet(Sheet sheet) {
        List<String []> res = new LinkedList<>();
        Iterator<Row> rowIterator = sheet.rowIterator();

        while (rowIterator.hasNext()) {
            Row row = rowIterator.next();
            int cellsNumber = row.getLastCellNum();
            String [] cellsValues = new String[cellsNumber];

            Iterator<Cell> cellIterator = row.cellIterator();
            int cellIndex = 0;

            while (cellIterator.hasNext()) {
                Cell cell = cellIterator.next();
                cellsValues[cellIndex++] = cell.getStringCellValue();
            }

            res.add(cellsValues);
        }
        return res;
    }

    public String getFile() {
        return file;
    }

    public void setFile(String file) {
        this.file = file;
    }

    private List<String []> processFirstSheet(String filename) throws Exception {
        OPCPackage pkg = OPCPackage.open(filename, PackageAccess.READ);
        XSSFReader r = new XSSFReader(pkg);
        SharedStringsTable sst = r.getSharedStringsTable();

        SheetHandler handler = new SheetHandler(sst);
        XMLReader parser = fetchSheetParser(handler);
        Iterator<InputStream> sheetIterator = r.getSheetsData();

        if (!sheetIterator.hasNext()) {
            return Collections.emptyList();
        }

        InputStream sheetInputStream = sheetIterator.next();
        BufferedInputStream bisSheet = new BufferedInputStream(sheetInputStream);
        InputSource sheetSource = new InputSource(bisSheet);
        parser.parse(sheetSource);
        List<String []> res = handler.getRowCache();
        bisSheet.close();
        return res;
    }

    public XMLReader fetchSheetParser(ContentHandler handler) throws SAXException {
        XMLReader parser = new SAXParser();
        parser.setContentHandler(handler);
        return parser;
    }

    /**
     * See org.xml.sax.helpers.DefaultHandler javadocs
     */
    private static class SheetHandler extends DefaultHandler {

        private static final String ROW_EVENT = "row";
        private static final String CELL_EVENT = "c";

        private SharedStringsTable sst;
        private String lastContents;
        private boolean nextIsString;

        private List<String> cellCache = new LinkedList<>();
        private List<String[]> rowCache = new LinkedList<>();

        private SheetHandler(SharedStringsTable sst) {
            this.sst = sst;
        }

        public void startElement(String uri, String localName, String name,
                                 Attributes attributes) throws SAXException {
            // c => cell
            if (CELL_EVENT.equals(name)) {
                String cellType = attributes.getValue("t");
                if(cellType != null && cellType.equals("s")) {
                    nextIsString = true;
                } else {
                    nextIsString = false;
                }
            } else if (ROW_EVENT.equals(name)) {
                if (!cellCache.isEmpty()) {
                    rowCache.add(cellCache.toArray(new String[cellCache.size()]));
                }
                cellCache.clear();
            }

            // Clear contents cache
            lastContents = "";
        }

        public void endElement(String uri, String localName, String name)
                throws SAXException {
            // Process the last contents as required.
            // Do now, as characters() may be called more than once
            if(nextIsString) {
                int idx = Integer.parseInt(lastContents);
                lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
                nextIsString = false;
            }

            // v => contents of a cell
            // Output after we've seen the string contents
            if(name.equals("v")) {
                cellCache.add(lastContents);
            }
        }

        public void characters(char[] ch, int start, int length)
                throws SAXException {
            lastContents += new String(ch, start, length);
        }

        public List<String[]> getRowCache() {
            return rowCache;
        }
    }
}
Share:
11,813

Related videos on Youtube

Nicklas A.
Author by

Nicklas A.

Updated on July 08, 2022

Comments

  • Nicklas A.
    Nicklas A. almost 2 years

    I'm trying to write a pretty large XLSX file (4M+ cells) and I'm having some memory issues.

    I can't use SXSSF since I also need to read the existing cells in the template.

    Is there anything I can do to reduce the memory footprint?
    Perhaps combine streaming reading and streaming writing?

  • Nicklas A.
    Nicklas A. over 10 years
    The problem with SXSSF is that you can't read stuff at all. You can access stuff you've already written but you can't access stuff that's already in the workbook.
  • ParoTech
    ParoTech over 10 years
    For reading use the Event Model, a good sample is given at poi.apache.org/spreadsheet/how-to.html - XSSF and SAX (Event API) I have used it myself, it is a bit complicated but serves the purpose of reading huge data without running out of memory. You can also look at the small samples at poi.apache.org/spreadsheet/quick-guide.html to get an idea.
  • Nicklas A.
    Nicklas A. over 10 years
    But the thing is that we need to read a cell and then write that same cell again which will be tricky.
  • Paolo Forgia
    Paolo Forgia almost 6 years
    I tried your code and I happened to notice that it doesn't read the last row, are you aware of this?
  • DaviM
    DaviM about 3 years
    Looks like to handle empty cells, a cell must have a style. If you check generated xml file from excel. If it does contain a style, it appears inside xml file. Empty cell without data tag (enough to use the code above)
  • Yan Khonski
    Yan Khonski about 3 years
    Feel free to update the code, it was long time when I used it. Maybe the latest version of the library can solve this issue or provides an alternative way to solve this problem.