Importing a big xlsx file into R?

106,648

Solution 1

I stumbled on this question when someone sent me (yet another) Excel file to analyze. This one isn't even that big but for whatever reason I was running into a similar error:

java.lang.OutOfMemoryError: GC overhead limit exceeded

Based on comment by @DirkEddelbuettel in a previous answer I installed the openxlsx package (http://cran.r-project.org/web/packages/openxlsx/). and then ran:

library("openxlsx")
mydf <- read.xlsx("BigExcelFile.xlsx", sheet = 1, startRow = 2, colNames = TRUE)

It was just what I was looking for. Easy to use and wicked fast. It's my new BFF. Thanks for the tip @DirkEddelbuettel!

Solution 2

options(java.parameters = "-Xmx2048m")  ## memory set to 2 GB
library(XLConnect)

allow for more memory using "options" before any java component is loaded. Then load XLConnect library (it uses java).

That's it. Start reading in data with readWorksheet .... and so on. :)

Solution 3

I do agree with @orville jackson response & it really helped me too.

Inline to the answer provided by @orville jackson. here is the detailed description of how you can use openxlsx for reading and writing big files.

When data size is small, R has many packages and functions which can be utilized as per your requirement.

write.xlsx, write.xlsx2, XLconnect also do the work but these are sometimes slow as compare to openxlsx.

So, if you are dealing with the large data sets and came across java errors. I would suggest to have a look of "openxlsx" which is really awesome and reduce the time by 1/12th.

I've tested all and finally i was really impressed with the performance of openxlsx capabilities.

Here are the steps for writing multiple datasets into multiple sheets.

install.packages("openxlsx")
library("openxlsx")

start.time <- Sys.time()

# Creating large data frame
x <- as.data.frame(matrix(1:4000000,200000,20))
y <- as.data.frame(matrix(1:4000000,200000,20))
z <- as.data.frame(matrix(1:4000000,200000,20))

# Creating a workbook
wb <- createWorkbook("Example.xlsx")
Sys.setenv("R_ZIPCMD" = "C:/Rtools/bin/zip.exe") ## path to zip.exe

Sys.setenv("R_ZIPCMD" = "C:/Rtools/bin/zip.exe") has to be static as it takes reference of some utility from Rtools.

Note: Incase Rtools is not installed on your system, please install it first for smooth experience. here is the link for your reference: (choose appropriate version) https://cran.r-project.org/bin/windows/Rtools/

check the options as per link below (need to select all the check box while installation) https://cloud.githubusercontent.com/assets/7400673/12230758/99fb2202-b8a6-11e5-82e6-836159440831.png

# Adding a worksheets : parameters for addWorksheet are 1. Workbook Name 2. Sheet Name

addWorksheet(wb, "Sheet 1")
addWorksheet(wb, "Sheet 2")
addWorksheet(wb, "Sheet 3")

# Writing data in to respetive sheets: parameters for writeData are 1. Workbook Name 2. Sheet index/ sheet name 3. dataframe name

writeData(wb, 1, x)

# incase you would like to write sheet with filter available for ease of access you can pass the parameter withFilter = TRUE in writeData function.
writeData(wb, 2, x = y, withFilter = TRUE)

## Similarly writeDataTable is another way for representing your data with table formatting:

writeDataTable(wb, 3, z)

saveWorkbook(wb, file = "Example.xlsx", overwrite = TRUE)

end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

openxlsx package is really good for reading and writing huge data from/ in excel files and has lots of options for custom formatting within excel.

The interesting fact is that we don’t have to bother about java heap memory here.

Solution 4

I know this question is a bit old, but There is a good solution for this nowadays. This is a default package when you try to import excel in Rstudio with GUI and It works well in my situation.

library(readxl)

data <- read_excel(filename)

Solution 5

As mentioned in the canonical Excel->R question, a recent alternative which has emerged comes from the readxl package, which I've found to be quite fast, compared with, e.g. openxlsx and xlsx.

That said, there's a definite limit of spreadsheet size past which you're probably better off just saving the thing as a .csv and using fread.

Share:
106,648

Related videos on Youtube

user2722443
Author by

user2722443

Updated on July 05, 2022

Comments

  • user2722443
    user2722443 almost 2 years

    I'm wondering if anyone knows of a way to import data from a "big" xlsx file (~20Mb). I tried to use xlsx and XLConnect libraries. Unfortunately, both use rJava and I always obtain the same error:

    > library(XLConnect)
    > wb <- loadWorkbook("MyBigFile.xlsx")
    Error: OutOfMemoryError (Java): Java heap space
    

    or

    > library(xlsx)
    > mydata <- read.xlsx2(file="MyBigFile.xlsx")
    Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
       java.lang.OutOfMemoryError: Java heap space
    

    I also tried to modify the java.parameters before loading rJava:

    > options( java.parameters = "-Xmx2500m")
    > library(xlsx) # load rJava
    > mydata <- read.xlsx2(file="MyBigFile.xlsx")
    Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
       java.lang.OutOfMemoryError: Java heap space
    

    or after loading rJava (this is a bit stupid, I think):

    > library(xlsx) # load rJava
    > options( java.parameters = "-Xmx2500m")
    > mydata <- read.xlsx2(file="MyBigFile.xlsx")
    Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
       java.lang.OutOfMemoryError: Java heap space
    

    But nothing works. Does anyone have an idea?

    • flodel
      flodel over 10 years
      Have you considered saving your data into a more universal format, e.g. csv?
    • Ricardo Saporta
      Ricardo Saporta over 10 years
      gdata is another option. I believe it is not java based, but I could be mistaken.
    • Ben
      Ben over 10 years
      That's right, gdata uses Perl
    • Spacedman
      Spacedman over 10 years
      Why is it that big? Lots of rows (do you need them all?), lots of columns (do you need them all?), lots of individual sheets (do you need them all?), one high-resolution embedded image (you don't need that...)? For spreadsheet and other binary files the size of the file in bytes is often not a useful measure of how big the data in it really is.
    • user2722443
      user2722443 over 10 years
      gdata works... very slowly, about 7 min by sheet but it works.
    • user2722443
      user2722443 over 10 years
      @flodel: you are right about csv, usually I do that. Unfortunately in my case, I have no choice because my inputs are several xlsx file with 5 sheets (10000 rows x 80 columns). I could manually open each with Excel and export in csv (or write some VBA codes to do that) but I'd rather do it entirely in R.
    • user2722443
      user2722443 over 10 years
      @Spacedman: My xlsx file only contents "raw data" (numeric and some factors).
    • Matt Parker
      Matt Parker almost 10 years
      I've been working on importing a colleague's monstrous, formula-laden Excel file (150 MB), and gdata was the only Excel package that could pull it off. As here, Java-based packages ran out of memory; openxlsx segfaulted. gdata took 30 minutes per sheet, but it got the job done.
    • Oeufcoque Penteano
      Oeufcoque Penteano over 8 years
      +1 gdata, had to load 12 excel tables mid-sized and xlsx took an horrendous amount of time. gdata made it a breeze.
    • HNSKD
      HNSKD over 7 years
      gdata requires PERL. Anyone knows what is that?
  • user2722443
    user2722443 over 10 years
    Unfortunately, the loadWorkbook command generates an "OutOfMemoryError". With the same idea, I tried mydata.chunk = read.xlsx2(file="MyBigFile.xlsx", sheetIndex=1, startRow=1, endRow=10), but it's still the same error.
  • Ricardo Saporta
    Ricardo Saporta over 10 years
    @user2722443, are you saving the portions you've read in, then removing them from memory? also try running gc() in each for loop. It will slow you down, but clear out some memory. Incidentally, are you sure that converting to CSV is out of the quesiton?
  • user2722443
    user2722443 over 10 years
    @{Ricardo Saporta} in fact the mydata.chunk = read.xlsx2(file="MyBigFile.xlsx", sheetIndex=1, startRow=1, endRow=10) generates an "OutOfMemoryError". So I can't remove anything. Concerning the CSV conversion, it's not totally out of the question but it's an external operation (before loading in R).
  • Dirk Eddelbuettel
    Dirk Eddelbuettel almost 10 years
    ... which is what has been available for a decade in the gdata package for R (but using Perl behind the scenes).
  • aaron
    aaron almost 10 years
    when i worked on the problem using gdata it was unacceptably slow. this python scripts converts large xlsx files extremely quickly
  • mlt
    mlt almost 10 years
    How is this answer different from @flodel's suggestion mentioned in another answer? IMHO RODBC has few advantages over intermediate CSV format.
  • Dirk Eddelbuettel
    Dirk Eddelbuettel almost 10 years
    There is also a new kid on the block: openxlsx which uses just Rcpp and nothing but C++ code--and claims to be very fast. Not sure how refined it is.
  • nasia jaffri
    nasia jaffri over 9 years
    I tried so many methods to read a big .xslx file, but nothing seemed to work for me. I was getting an error when I was using Schaun Wheeler's function at github, and could not figure out how to use the perl command in gdata for my computer. 'openxlsx" is such a life saver for me. Thanks @Dirk Eddelbuettel and Orville Jackson.
  • user124123
    user124123 over 9 years
    Do you know of another solution? I can't find a way to open .xls files with openxlsx
  • orville jackson
    orville jackson over 9 years
    You could try the read.xls function in the gdata package. Never used it myself but worth a shot.
  • agenis
    agenis about 8 years
    openxlsx is the only library that worked for my excel file (70Mo). but i had first to convert from .xls to .xlsx
  • pbnelson
    pbnelson about 7 years
    Thanks for the tip. Important to note: I had to issue the options(java.parameters = "-Xmx2048m") before issuing require('rJava') when using this within R-Studio. Unfortunately I'm getting a new error, now: "java.lang.OutOfMemoryError: GC overhead limit exceeded", but that's a different problem, I'm sure.
  • MattE
    MattE almost 7 years
    why wouldn't you just open it in excel and export to CSV?
  • Ali
    Ali almost 7 years
    Tested read.xlsx2, XLConnect, readxl and openxlsx and openxlsx is multiple times faster than others
  • peer
    peer almost 6 years
    OpenXLSX has the disadvantage that it does not recognize dates. To me, read_excel from the package readxl seems like the way to go.
  • Abhishek
    Abhishek over 5 years
    If openxlsx also leads to same error. Then, increase the RAM size if working on datalakes having option to change the configuration.
  • Simon Woodward
    Simon Woodward about 5 years
    This worked for me, but I also had to make sure my R version matched my Java version (e.g. both 64-bit), and set the Java path correctly: options(java.parameters="-Xmx4g") # increase java memory, Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jdk-11.0.2') # for 64-bit version, library(rJava) # check it works