How to implement coalesce efficiently in R

29,403

Solution 1

On my machine, using Reduce gets a 5x performance improvement:

coalesce2 <- function(...) {
  Reduce(function(x, y) {
    i <- which(is.na(x))
    x[i] <- y[i]
    x},
  list(...))
}

> microbenchmark(coalesce(a,b,c),coalesce2(a,b,c))
Unit: microseconds
               expr    min       lq   median       uq     max neval
  coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438   100
 coalesce2(a, b, c) 19.601  21.4055  22.8835  23.8315  45.419   100

Solution 2

Looks like coalesce1 is still available

coalesce1 <- function(...) {
    ans <- ..1
    for (elt in list(...)[-1]) {
        i <- is.na(ans)
        ans[i] <- elt[i]
    }
    ans
}

which is faster still (but more-or-less a hand re-write of Reduce, so less general)

> identical(coalesce(a, b, c), coalesce1(a, b, c))
[1] TRUE
> microbenchmark(coalesce(a,b,c), coalesce1(a, b, c), coalesce2(a,b,c))
Unit: microseconds
               expr     min       lq   median       uq     max neval
  coalesce(a, b, c) 336.266 341.6385 344.7320 355.4935 538.348   100
 coalesce1(a, b, c)   8.287   9.4110  10.9515  12.1295  20.940   100
 coalesce2(a, b, c)  37.711  40.1615  42.0885  45.1705  67.258   100

Or for larger data compare

coalesce1a <- function(...) {
    ans <- ..1
    for (elt in list(...)[-1]) {
        i <- which(is.na(ans))
        ans[i] <- elt[i]
    }
    ans
}

showing that which() can sometimes be effective, even though it implies a second pass through the index.

> aa <- sample(a, 100000, TRUE)
> bb <- sample(b, 100000, TRUE)
> cc <- sample(c, 100000, TRUE)
> microbenchmark(coalesce1(aa, bb, cc),
+                coalesce1a(aa, bb, cc),
+                coalesce2(aa,bb,cc), times=10)
Unit: milliseconds
                   expr       min        lq    median        uq       max neval
  coalesce1(aa, bb, cc) 11.110024 11.137963 11.145723 11.212907 11.270533    10
 coalesce1a(aa, bb, cc)  2.906067  2.953266  2.962729  2.971761  3.452251    10
  coalesce2(aa, bb, cc)  3.080842  3.115607  3.139484  3.166642  3.198977    10

Solution 3

From data.table >= 1.12.3 you can use fcoalesce.

library(data.table)
fcoalesce(a, b, c)
# [1]  1  2 NA  4  6

fcoalesce can also take "a single plain list, data.table or data.frame". Thus, if the vectors above were columns in a data.frame (or a data.table), we could simply supply the name of the data set:

d = data.frame(a, b, c)
# or d = data.table(a, b, c) 
fcoalesce(d)
# [1]  1  2 NA  4  6

For more info, including a benchmark, see NEWS item #18 for development version 1.12.3.

Solution 4

Using dplyr package:

library(dplyr)
coalesce(a, b, c)
# [1]  1  2 NA  4  6

Benchamark, not as fast as accepted solution:

coalesce2 <- function(...) {
  Reduce(function(x, y) {
    i <- which(is.na(x))
    x[i] <- y[i]
    x},
    list(...))
}

microbenchmark::microbenchmark(
  coalesce(a, b, c),
  coalesce2(a, b, c)
)

# Unit: microseconds
#                expr    min     lq     mean median      uq     max neval cld
#   coalesce(a, b, c) 21.951 24.518 27.28264 25.515 26.9405 126.293   100   b
#  coalesce2(a, b, c)  7.127  8.553  9.68731  9.123  9.6930  27.368   100  a 

But on a larger dataset, it is comparable:

aa <- sample(a, 100000, TRUE)
bb <- sample(b, 100000, TRUE)
cc <- sample(c, 100000, TRUE)

microbenchmark::microbenchmark(
  coalesce(aa, bb, cc),
  coalesce2(aa, bb, cc))

# Unit: milliseconds
#                   expr      min       lq     mean   median       uq      max neval cld
#   coalesce(aa, bb, cc) 1.708511 1.837368 5.468123 3.268492 3.511241 96.99766   100   a
#  coalesce2(aa, bb, cc) 1.474171 1.516506 3.312153 1.957104 3.253240 91.05223   100   a

Solution 5

I have a ready-to-use implementation called coalesce.na in my misc package. It seems to be competitive, but not fastest. It will also work for vectors of different length, and has a special treatment for vectors of length one:

                    expr        min          lq      median          uq         max neval
    coalesce(aa, bb, cc) 990.060402 1030.708466 1067.000698 1083.301986 1280.734389    10
   coalesce1(aa, bb, cc)  11.356584   11.448455   11.804239   12.507659   14.922052    10
  coalesce1a(aa, bb, cc)   2.739395    2.786594    2.852942    3.312728    5.529927    10
   coalesce2(aa, bb, cc)   2.929364    3.041345    3.593424    3.868032    7.838552    10
 coalesce.na(aa, bb, cc)   4.640552    4.691107    4.858385    4.973895    5.676463    10

Here's the code:

coalesce.na <- function(x, ...) {
  x.len <- length(x)
  ly <- list(...)
  for (y in ly) {
    y.len <- length(y)
    if (y.len == 1) {
      x[is.na(x)] <- y
    } else {
      if (x.len %% y.len != 0)
        warning('object length is not a multiple of first object length')
      pos <- which(is.na(x))
      x[pos] <- y[(pos - 1) %% y.len + 1]
    }
  }
  x
}

Of course, as Kevin pointed out, an Rcpp solution might be faster by orders of magnitude.

Share:
29,403
while
Author by

while

I'm a data scientist in Stockholm who love linux, climbing, coding and statistics.

Updated on February 25, 2021

Comments

  • while
    while over 3 years

    Background

    Several SQL languages (I mostly use postgreSQL) have a function called coalesce which returns the first non null column element for each row. This can be very efficient to use when tables have a lot of NULL elements in them.

    I encounter this in a lot of scenarios in R as well when dealing with not so structured data which has a lot of NA's in them.

    I have made a naive implementation myself but it is ridiculously slow.

    coalesce <- function(...) {
      apply(cbind(...), 1, function(x) {
              x[which(!is.na(x))[1]]
            })
    }
    

    Example

    a <- c(1,  2,  NA, 4, NA)
    b <- c(NA, NA, NA, 5, 6)
    c <- c(7,  8,  NA, 9, 10)
    coalesce(a,b,c)
    # [1]  1  2 NA  4  6
    

    Question

    Is there any efficient way to implement coalesce in R?

  • while
    while over 10 years
    Thanks! This is tremendously faster! Second time today Reduce saves my ass. Tried to do something like this myself but I couldn't figure it out. Awesome.
  • mrip
    mrip over 10 years
    Interesting that it's so much faster than Reduce. My guess would be that it's because the loop version avoids the overhead of a function call (or possibly some parameter checking), which might be a bottleneck when the inputs are small. That would explain why the times are much closer with large inputs.
  • agstudy
    agstudy over 10 years
    @MartinMorgan +1! maybe you should add a test on number of arguments something like ` if(length(list(...)>0)){}` to avoid an error when you call coalesce1() without arguments..
  • Kevin Jin
    Kevin Jin almost 9 years
    Works great if vectors passed are the same length. I had problems with this because I was trying coalesce2(NULL, c(3, 2)), so is.na() gave me warnings whereas coalesce() just repeats NA twice for the first column.
  • while
    while almost 9 years
    Sure. But this does not work for more than two vectors. What if you have an arbitrary amount of vectors?
  • sdgfsdh
    sdgfsdh almost 9 years
    Yes that is a limitation. Quick fix is to use Reduce.
  • r2evans
    r2evans over 5 years
    This fails if either is NULL instead of just NA. Recommend use of is.null.
  • Ufos
    Ufos about 5 years
    interestingly, an explicit if-else does not convert dates to numbers, but ifelse() does. So you function on dates does not work: coalesce4(NULL, lubridate::ymd('2019-05-01')) returns 18017
  • Gregor Thomas
    Gregor Thomas over 4 years
    Turning into a one-row-at-a-time operation is pretty inefficient, and the syntax seems weird to me.