Low memory writing/reading with Apache POI
Solution 1
To handle large data with low memory, the best and I think the only option is SXSSF api-s. If you need to read some data of the existing cells, I assume you do not need the entire 4M+ at the same time. In such a case based on your application requirement, you can handle the window size yourself and keep in memory only the amount of data you need at a particular time. You can start by looking at the example at : http://poi.apache.org/spreadsheet/how-to.html#sxssf
Something as
SXSSFWorkbook wb = new SXSSFWorkbook(-1); // turn off auto-flushing and accumulate all rows in memory
// manually control how rows are flushed to disk
if(rownum % NOR == 0) {
((SXSSFSheet)sh).flushRows(NOR); // retain NOR last rows and flush all others
Hope this helps.
Solution 2
I used SAX parser to process events of the XML document presentation. This is
import com.sun.org.apache.xerces.internal.parsers.SAXParser;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import java.io.BufferedInputStream;
import java.io.InputStream;
import java.util.Collections;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
public class LowMemoryExcelFileReader {
private String file;
public LowMemoryExcelFileReader(String file) {
this.file = file;
}
public List<String[]> read() {
try {
return processFirstSheet(file);
} catch (Exception e) {
throw new RuntimeException(e);
}
}
private List<String []> readSheet(Sheet sheet) {
List<String []> res = new LinkedList<>();
Iterator<Row> rowIterator = sheet.rowIterator();
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int cellsNumber = row.getLastCellNum();
String [] cellsValues = new String[cellsNumber];
Iterator<Cell> cellIterator = row.cellIterator();
int cellIndex = 0;
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
cellsValues[cellIndex++] = cell.getStringCellValue();
}
res.add(cellsValues);
}
return res;
}
public String getFile() {
return file;
}
public void setFile(String file) {
this.file = file;
}
private List<String []> processFirstSheet(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename, PackageAccess.READ);
XSSFReader r = new XSSFReader(pkg);
SharedStringsTable sst = r.getSharedStringsTable();
SheetHandler handler = new SheetHandler(sst);
XMLReader parser = fetchSheetParser(handler);
Iterator<InputStream> sheetIterator = r.getSheetsData();
if (!sheetIterator.hasNext()) {
return Collections.emptyList();
}
InputStream sheetInputStream = sheetIterator.next();
BufferedInputStream bisSheet = new BufferedInputStream(sheetInputStream);
InputSource sheetSource = new InputSource(bisSheet);
parser.parse(sheetSource);
List<String []> res = handler.getRowCache();
bisSheet.close();
return res;
}
public XMLReader fetchSheetParser(ContentHandler handler) throws SAXException {
XMLReader parser = new SAXParser();
parser.setContentHandler(handler);
return parser;
}
/**
* See org.xml.sax.helpers.DefaultHandler javadocs
*/
private static class SheetHandler extends DefaultHandler {
private static final String ROW_EVENT = "row";
private static final String CELL_EVENT = "c";
private SharedStringsTable sst;
private String lastContents;
private boolean nextIsString;
private List<String> cellCache = new LinkedList<>();
private List<String[]> rowCache = new LinkedList<>();
private SheetHandler(SharedStringsTable sst) {
this.sst = sst;
}
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
// c => cell
if (CELL_EVENT.equals(name)) {
String cellType = attributes.getValue("t");
if(cellType != null && cellType.equals("s")) {
nextIsString = true;
} else {
nextIsString = false;
}
} else if (ROW_EVENT.equals(name)) {
if (!cellCache.isEmpty()) {
rowCache.add(cellCache.toArray(new String[cellCache.size()]));
}
cellCache.clear();
}
// Clear contents cache
lastContents = "";
}
public void endElement(String uri, String localName, String name)
throws SAXException {
// Process the last contents as required.
// Do now, as characters() may be called more than once
if(nextIsString) {
int idx = Integer.parseInt(lastContents);
lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
nextIsString = false;
}
// v => contents of a cell
// Output after we've seen the string contents
if(name.equals("v")) {
cellCache.add(lastContents);
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
lastContents += new String(ch, start, length);
}
public List<String[]> getRowCache() {
return rowCache;
}
}
}
Related videos on Youtube
Nicklas A.
Updated on July 08, 2022Comments
-
Nicklas A. almost 2 years
I'm trying to write a pretty large XLSX file (4M+ cells) and I'm having some memory issues.
I can't use SXSSF since I also need to read the existing cells in the template.
Is there anything I can do to reduce the memory footprint?
Perhaps combine streaming reading and streaming writing? -
Nicklas A. over 10 yearsThe problem with SXSSF is that you can't read stuff at all. You can access stuff you've already written but you can't access stuff that's already in the workbook.
-
ParoTech over 10 yearsFor reading use the Event Model, a good sample is given at poi.apache.org/spreadsheet/how-to.html - XSSF and SAX (Event API) I have used it myself, it is a bit complicated but serves the purpose of reading huge data without running out of memory. You can also look at the small samples at poi.apache.org/spreadsheet/quick-guide.html to get an idea.
-
Nicklas A. over 10 yearsBut the thing is that we need to read a cell and then write that same cell again which will be tricky.
-
Paolo Forgia almost 6 yearsI tried your code and I happened to notice that it doesn't read the last row, are you aware of this?
-
DaviM about 3 yearsLooks like to handle empty cells, a cell must have a style. If you check generated xml file from excel. If it does contain a style, it appears inside xml file. Empty cell without data tag (enough to use the code above)
-
Yan Khonski about 3 yearsFeel free to update the code, it was long time when I used it. Maybe the latest version of the library can solve this issue or provides an alternative way to solve this problem.