How can I prevent rbind() from geting really slow as dataframe grows larger?
Solution 1
You are in the 2nd circle of hell, namely failing to pre-allocate data structures.
Growing objects in this fashion is a Very Very Bad Thing in R. Either pre-allocate and insert:
df <- data.frame(x = rep(NA,20000),y = rep(NA,20000))
or restructure your code to avoid this sort of incremental addition of rows. As discussed at the link I cite, the reason for the slowness is that each time you add a row, R needs to find a new contiguous block of memory to fit the data frame in. Lots 'o copying.
Solution 2
I tried an example. For what it's worth, it agrees with the user's assertion that inserting rows into the data frame is also really slow. I don't quite understand what's going on, as I would have expected the allocation problem to trump the speed of copying. Can anyone either replicate this, or explain why the results below (rbind < appending < insertion) would be true in general, or explain why this is not a representative example (e.g. data frame too small)?
edit: the first time around I forgot to initialize the object in hell2fun
to a data frame, so the code was doing matrix operations rather than data frame operations, which are much faster. If I get a chance I'll extend the comparison to data frame vs. matrix. The qualitative assertions in the first paragraph hold, though.
N <- 1000
set.seed(101)
r <- matrix(runif(2*N),ncol=2)
## second circle of hell
hell2fun <- function() {
df <- as.data.frame(rbind(r[1,])) ## initialize
for (i in 2:N) {
df <- rbind(df,r[i,])
}
}
insertfun <- function() {
df <- data.frame(x=rep(NA,N),y=rep(NA,N))
for (i in 1:N) {
df[i,] <- r[i,]
}
}
rsplit <- as.list(as.data.frame(t(r)))
rbindfun <- function() {
do.call(rbind,rsplit)
}
library(rbenchmark)
benchmark(hell2fun(),insertfun(),rbindfun())
## test replications elapsed relative user.self
## 1 hell2fun() 100 32.439 484.164 31.778
## 2 insertfun() 100 45.486 678.896 42.978
## 3 rbindfun() 100 0.067 1.000 0.076
Comments
-
Mark almost 2 years
I have a dataframe with only 1 row. To this I start to add rows by using rbind
df #mydataframe with only one row for (i in 1:20000) { df<- rbind(df, newrow) }
this gets very slow as i grows. Why is that? and how can I make this type of code faster?
-
Mark over 11 yearsgreat. Thanks for the tip.
-
Mark over 11 yearsso I reallocated the dataframe and started inserting 1row dataframes into it (df[j]<- newrow). It seems to be also getting slow as the number of insertions grow. Have you seen this before?
-
joran over 11 years@Mark Yeah, like I said, this sort of thing is rather un-R-like. Modifying objects will still require a certain amount of copying of the entire object. Do you really want to copy the entire data frame each time you add a row? Probably not. Generate a list of each row using
lapply
and then stitch them together usingdo.call(rbind,...)
. But beyond that, the solution requires more refactoring than I can help with given the information you've provided. -
Mark over 11 yearsKudos. Thanks a lot. I have thought a lot about using apply here but the problem has such a weird shape my brain is incapable of comprehending a proper functional form for it :) thanks for the help
-
Ben Bolker over 11 yearsI'm a little surprised that the
df[j,] <- newrow
is also slow, and particularly that it would get slower later in the run. I can appreciate that it would require some dataframe copying, but it should be orders of magnitude faster than the second-circle-of-hell approach ... -
joran over 11 years@BenBolker Me too, but it's basically impossible to know what might be going on without more complete code.
-
mnel over 11 yearsI'd suggest using
rsplit <- split(data.frame(r), seq_len(nrow(r)))
for a fairer comparison, and assign within the function. Then usedata.table::rbindlist
stackoverflow.com/a/12718498/1385941 -
Ben Bolker over 11 yearswill do when I get a chance (I don't understand the first sentence yet ... if I understand correctly, I don't think the matrix-splitting should be charged to the function -- I assumed that the rows would become available one by one within some sort of iterative procedure)
-
mnel over 11 yearsassign the result
(df <- do.call(rbind, rsplit))
within the function (on reading, my commentwasunintelligible). -
user1892410 almost 8 years@BenBolker I changed my code to use df[j,] <- newrow and the code is fast enough even after hundread of thousands of rows in the dataframe. The end results will be a dataframe populated with about 2 million rows in it.