How to implement coalesce efficiently in R
Solution 1
On my machine, using Reduce
gets a 5x performance improvement:
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
> microbenchmark(coalesce(a,b,c),coalesce2(a,b,c))
Unit: microseconds
expr min lq median uq max neval
coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438 100
coalesce2(a, b, c) 19.601 21.4055 22.8835 23.8315 45.419 100
Solution 2
Looks like coalesce1 is still available
coalesce1 <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- is.na(ans)
ans[i] <- elt[i]
}
ans
}
which is faster still (but more-or-less a hand re-write of Reduce
, so less general)
> identical(coalesce(a, b, c), coalesce1(a, b, c))
[1] TRUE
> microbenchmark(coalesce(a,b,c), coalesce1(a, b, c), coalesce2(a,b,c))
Unit: microseconds
expr min lq median uq max neval
coalesce(a, b, c) 336.266 341.6385 344.7320 355.4935 538.348 100
coalesce1(a, b, c) 8.287 9.4110 10.9515 12.1295 20.940 100
coalesce2(a, b, c) 37.711 40.1615 42.0885 45.1705 67.258 100
Or for larger data compare
coalesce1a <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- which(is.na(ans))
ans[i] <- elt[i]
}
ans
}
showing that which()
can sometimes be effective, even though it implies a second pass through the index.
> aa <- sample(a, 100000, TRUE)
> bb <- sample(b, 100000, TRUE)
> cc <- sample(c, 100000, TRUE)
> microbenchmark(coalesce1(aa, bb, cc),
+ coalesce1a(aa, bb, cc),
+ coalesce2(aa,bb,cc), times=10)
Unit: milliseconds
expr min lq median uq max neval
coalesce1(aa, bb, cc) 11.110024 11.137963 11.145723 11.212907 11.270533 10
coalesce1a(aa, bb, cc) 2.906067 2.953266 2.962729 2.971761 3.452251 10
coalesce2(aa, bb, cc) 3.080842 3.115607 3.139484 3.166642 3.198977 10
Solution 3
From data.table >= 1.12.3
you can use fcoalesce
.
library(data.table)
fcoalesce(a, b, c)
# [1] 1 2 NA 4 6
fcoalesce
can also take "a single plain list, data.table or data.frame". Thus, if the vectors above were columns in a data.frame
(or a data.table
), we could simply supply the name of the data set:
d = data.frame(a, b, c)
# or d = data.table(a, b, c)
fcoalesce(d)
# [1] 1 2 NA 4 6
For more info, including a benchmark, see NEWS item #18 for development version 1.12.3.
Solution 4
Using dplyr package:
library(dplyr)
coalesce(a, b, c)
# [1] 1 2 NA 4 6
Benchamark, not as fast as accepted solution:
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
microbenchmark::microbenchmark(
coalesce(a, b, c),
coalesce2(a, b, c)
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# coalesce(a, b, c) 21.951 24.518 27.28264 25.515 26.9405 126.293 100 b
# coalesce2(a, b, c) 7.127 8.553 9.68731 9.123 9.6930 27.368 100 a
But on a larger dataset, it is comparable:
aa <- sample(a, 100000, TRUE)
bb <- sample(b, 100000, TRUE)
cc <- sample(c, 100000, TRUE)
microbenchmark::microbenchmark(
coalesce(aa, bb, cc),
coalesce2(aa, bb, cc))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# coalesce(aa, bb, cc) 1.708511 1.837368 5.468123 3.268492 3.511241 96.99766 100 a
# coalesce2(aa, bb, cc) 1.474171 1.516506 3.312153 1.957104 3.253240 91.05223 100 a
Solution 5
I have a ready-to-use implementation called coalesce.na
in my misc package. It seems to be competitive, but not fastest.
It will also work for vectors of different length, and has a special treatment for vectors of length one:
expr min lq median uq max neval
coalesce(aa, bb, cc) 990.060402 1030.708466 1067.000698 1083.301986 1280.734389 10
coalesce1(aa, bb, cc) 11.356584 11.448455 11.804239 12.507659 14.922052 10
coalesce1a(aa, bb, cc) 2.739395 2.786594 2.852942 3.312728 5.529927 10
coalesce2(aa, bb, cc) 2.929364 3.041345 3.593424 3.868032 7.838552 10
coalesce.na(aa, bb, cc) 4.640552 4.691107 4.858385 4.973895 5.676463 10
Here's the code:
coalesce.na <- function(x, ...) {
x.len <- length(x)
ly <- list(...)
for (y in ly) {
y.len <- length(y)
if (y.len == 1) {
x[is.na(x)] <- y
} else {
if (x.len %% y.len != 0)
warning('object length is not a multiple of first object length')
pos <- which(is.na(x))
x[pos] <- y[(pos - 1) %% y.len + 1]
}
}
x
}
Of course, as Kevin pointed out, an Rcpp solution might be faster by orders of magnitude.
while
I'm a data scientist in Stockholm who love linux, climbing, coding and statistics.
Updated on February 25, 2021Comments
-
while over 3 years
Background
Several SQL languages (I mostly use postgreSQL) have a function called coalesce which returns the first non null column element for each row. This can be very efficient to use when tables have a lot of
NULL
elements in them.I encounter this in a lot of scenarios in R as well when dealing with not so structured data which has a lot of NA's in them.
I have made a naive implementation myself but it is ridiculously slow.
coalesce <- function(...) { apply(cbind(...), 1, function(x) { x[which(!is.na(x))[1]] }) }
Example
a <- c(1, 2, NA, 4, NA) b <- c(NA, NA, NA, 5, 6) c <- c(7, 8, NA, 9, 10) coalesce(a,b,c) # [1] 1 2 NA 4 6
Question
Is there any efficient way to implement
coalesce
in R? -
while over 10 yearsThanks! This is tremendously faster! Second time today Reduce saves my ass. Tried to do something like this myself but I couldn't figure it out. Awesome.
-
mrip over 10 yearsInteresting that it's so much faster than Reduce. My guess would be that it's because the loop version avoids the overhead of a function call (or possibly some parameter checking), which might be a bottleneck when the inputs are small. That would explain why the times are much closer with large inputs.
-
agstudy over 10 years@MartinMorgan +1! maybe you should add a test on number of arguments something like ` if(length(list(...)>0)){}` to avoid an error when you call
coalesce1()
without arguments.. -
Kevin Jin almost 9 yearsWorks great if vectors passed are the same length. I had problems with this because I was trying coalesce2(NULL, c(3, 2)), so is.na() gave me warnings whereas coalesce() just repeats NA twice for the first column.
-
while almost 9 yearsSure. But this does not work for more than two vectors. What if you have an arbitrary amount of vectors?
-
sdgfsdh almost 9 yearsYes that is a limitation. Quick fix is to use
Reduce
. -
r2evans over 5 yearsThis fails if either is
NULL
instead of justNA
. Recommend use ofis.null
. -
Ufos about 5 yearsinterestingly, an explicit if-else does not convert dates to numbers, but
ifelse()
does. So you function on dates does not work:coalesce4(NULL, lubridate::ymd('2019-05-01'))
returns18017
-
Gregor Thomas over 4 yearsTurning into a one-row-at-a-time operation is pretty inefficient, and the syntax seems weird to me.