Read multiple xlsx files with multiple sheets into one R data frame
Solution 1
I would use a nested loop like this to go through each sheet of each file. It might not be the fastest solution but it is the simplest.
require(xlsx)
file.list <- list.files(recursive=T,pattern='*.xlsx') #get files list from folder
for (i in 1:length(files.list)){
wb <- loadWorkbook(files.list[i]) #select a file & load workbook
sheet <- getSheets(wb) #get sheet list
for (j in 1:length(sheet)){
tmp<-read.xlsx(files.list[i], sheetIndex=j, colIndex= c(1:6,8:10,12:17,19),
sheetName=NULL, startRow=4, endRow=NULL,
as.data.frame=TRUE, header=F)
if (i==1&j==1) dataset<-tmp else dataset<-rbind(dataset,tmp) #happend to previous
}
}
You can clean NA
values after the loading phase.
Solution 2
openxlsx solution:
filename <-"myFilePath"
sheets <- openxlsx::getSheetNames(filename)
SheetList <- lapply(sheets,openxlsx::read.xlsx,xlsxFile=filename)
names(SheetList) <- sheets
Solution 3
Here's a tidyverse
and readxl
driven option that returns a data frame with columns for file and sheet names for each file.
In this example, not every file has the same sheets or columns; test2.xlsx has only one sheet and test3.xlsx sheet1 does not have col3.
library(tidyverse)
library(readxl)
dir_path <- "~/test_dir/" # target directory path where the xlsx files are located.
re_file <- "^test[0-9]\\.xlsx" # regex pattern to match the file name format, in this case 'test1.xlsx', 'test2.xlsx' etc, but could simply be 'xlsx'.
read_sheets <- function(dir_path, file){
xlsx_file <- paste0(dir_path, file)
xlsx_file %>%
excel_sheets() %>%
set_names() %>%
map_df(read_excel, path = xlsx_file, .id = 'sheet_name') %>%
mutate(file_name = file) %>%
select(file_name, sheet_name, everything())
}
df <- list.files(dir_path, re_file) %>%
map_df(~ read_sheets(dir_path, .))
# A tibble: 15 x 5
file_name sheet_name col1 col2 col3
<chr> <chr> <dbl> <dbl> <dbl>
1 test1.xlsx Sheet1 1 2 4
2 test1.xlsx Sheet1 3 2 3
3 test1.xlsx Sheet1 2 4 4
4 test1.xlsx Sheet2 3 3 1
5 test1.xlsx Sheet2 2 2 2
6 test1.xlsx Sheet2 4 3 4
7 test2.xlsx Sheet1 1 3 5
8 test2.xlsx Sheet1 4 4 3
9 test2.xlsx Sheet1 1 2 2
10 test3.xlsx Sheet1 3 9 NA
11 test3.xlsx Sheet1 4 7 NA
12 test3.xlsx Sheet1 5 3 NA
13 test3.xlsx Sheet2 1 3 4
14 test3.xlsx Sheet2 2 5 9
15 test3.xlsx Sheet2 4 3 1
Elisah
Updated on June 12, 2022Comments
-
Elisah almost 2 years
I have been reading up on how to read and combine multiple xlsx files into one R data frame and have come across some very good suggestions like, How to read multiple xlsx file in R using loop with specific rows and columns, but non fits my data set so far.
I would like R to read in multiple xlsx files with that have multiple sheets. All sheets and files have the same columns but not the same length and NA's should be excluded. I want to skip the first 3 rows and only take in columns 1:6, 8:10, 12:17, 19.
So far I tried:
file.list <- list.files(recursive=T,pattern='*.xlsx') dat = lapply(file.list, function(i){ x = read.xlsx(i, sheetIndex=1, sheetName=NULL, startRow=4, endRow=NULL, as.data.frame=TRUE, header=F) # Column select x = x[, c(1:6,8:10,12:17,19)] # Create column with file name x$file = i # Return data x }) dat = do.call("rbind.data.frame", dat)
But this only takes all the first sheet of every file
Does anyone know how to get all the sheets and files together in one R data frame?
Also, what packages would you recommend for large sets of data? So far I tried readxl and XLConnect.
-
Elisah almost 8 yearsThanks very much! It's doesn't work for me quit well yet. I'm getting a warning: Error in charToDate(x) : character string is not in a standard unambiguous format
-
GPierre almost 8 yearsThis sounds like an error you would get after loading everything, when formatting your date variable. With that error, you usually have to manipulate your date variable to get it to a standard format. See this post.
-
GPierre almost 8 yearsIt could also come from R trying to guess the class of your columns and wrongly assume it is a date (see
? read.xlsx
). In that case, I would suggest to use thecolClasses
argument to force all columns to be loaded as strings. And clean the data after everything is loaded (give correct classes, NA..) . -
Roman almost 8 yearsInstead of using two loops you could replace the second by a
sapply
function such asdo.call("rbind", sapply(sheet, function(x, y) read.xlsx(y, x), files.list[i]))
-
SeanM over 6 yearsI think this is a faster and more elegant solution!! Thanks!
-
coip about 4 yearsShouldn't
files.list
in your for-loop byfile.list
instead?