Efficient apply or mapply for multiple matrix arguments by row

10,166

Solution 1

Splitting the matrices isn't the biggest contributor to evaluation time.

set.seed(21)
matrixA <- matrix(rnorm(5 * 9000), nrow = 9000)
matrixB <- matrix(rnorm(4 * 9000), nrow = 9000)

system.time( scores <- mapply(t.test.stat,
    split(matrixA, row(matrixA)), split(matrixB, row(matrixB))) )
#    user  system elapsed 
#    1.57    0.00    1.58 
smA <- split(matrixA, row(matrixA))
smB <- split(matrixB, row(matrixB))
system.time( scores <- mapply(t.test.stat, smA, smB) )
#    user  system elapsed 
#    1.14    0.00    1.14 

Look at the output from Rprof to see that most of the time is--not surprisingly--spent evaluating t.test.stat (mean, var, etc.). Basically, there's quite a bit of overhead from function calls.

Rprof()
scores <- mapply(t.test.stat, smA, smB)
Rprof(NULL)
summaryRprof()

You may be able to find faster generalized solutions, but none will approach the speed of the vectorized solution below.

Since your function is simple, you can take advantage of the vectorized rowMeans function to do this almost instantaneously (though it's a bit messy):

system.time({
ncA <- NCOL(matrixA)
ncB <- NCOL(matrixB)
ans <- (rowMeans(matrixA)-rowMeans(matrixB)) /
  sqrt( rowMeans((matrixA-rowMeans(matrixA))^2)*(ncA/(ncA-1))/ncA +
        rowMeans((matrixB-rowMeans(matrixB))^2)*(ncB/(ncB-1))/ncB )
})
#    user  system elapsed 
#      0       0       0 
head(ans)
# [1]  0.8272511 -1.0965269  0.9862844 -0.6026452 -0.2477661  1.1896181

UPDATE
Here's a "cleaner" version using a rowVars function:

rowVars <- function(x, na.rm=FALSE, dims=1L) {
  rowMeans((x-rowMeans(x, na.rm, dims))^2, na.rm, dims)*(NCOL(x)/(NCOL(x)-1))
}
ans <- (rowMeans(matrixA)-rowMeans(matrixB)) /
  sqrt( rowVars(matrixA)/NCOL(matrixA) + rowVars(matrixB)/NCOL(matrixB) )

Solution 2

This solution avoids splitting, and lists, so maybe it will be faster than your version:

## original data:
tmp1 <- matrix(sample(1:100, 20), nrow = 5)
tmp2 <- matrix(sample(1:100, 20), nrow = 5)

## combine them together
tmp3 <- cbind(tmp1, tmp2)

## calculate t.stats:
t.stats <- apply(tmp3, 1, function(x) t.test(x[1:ncol(tmp1)], 
  x[(1 + ncol(tmp1)):ncol(tmp3)])$statistic)

Edit: Just tested it on two matrices of 9000 rows and 5 columns each, and it completed in less than 6 seconds:

tmp1 <- matrix(rnorm(5 * 9000), nrow = 9000)
tmp2 <- matrix(rnorm(5 * 9000), nrow = 9000)
tmp3 <- cbind(tmp1, tmp2)
system.time(t.st <- apply(tmp3, 1, function(x) t.test(x[1:5], x[6:10])$statistic))

-> user system elapsed

-> 5.640 0.012 5.705

Share:
10,166

Related videos on Youtube

Edd
Author by

Edd

I'm a software engineer on the Android Kernel Release team at Google

Updated on April 13, 2020

Comments

  • Edd
    Edd about 4 years

    I have two matrices that I want to apply a function to, by rows:

    matrixA
               GSM83009  GSM83037  GSM83002  GSM83029  GSM83041
    100001_at  5.873321  5.416164  3.512227  6.064150  3.713696
    100005_at  5.807870  6.810829  6.105804  6.644000  6.142413
    100006_at  2.757023  4.144046  1.622930  1.831877  3.694880
    
    matrixB
              GSM82939 GSM82940 GSM82974 GSM82975
    100001_at 3.673556 2.372952 3.228049 3.555816
    100005_at 6.916954 6.909533 6.928252 7.003377
    100006_at 4.277985 4.856986 3.670161 4.075533
    

    I've found several similar questions, but not a whole lot of answers: mapply for matrices, Multi matrix row-wise mapply?. The code I have now splits the matrices by row into lists, but having to split it makes it rather slow and not much faster than a for loop, considering I have almost 9000 rows in each matrix:

    scores <- mapply(t.test.stat, split(matrixA, row(matrixA)), split(matrixB, row(matrixB)))
    

    The function itself is very simple, just finding the t-value:

    t.test.stat <- function(x, y)
    {
        return( (mean(x) - mean(y)) / sqrt(var(x)/length(x) + var(y)/length(y)) )
    }
    
  • Joris Meys
    Joris Meys about 13 years
    +1 for showing the function t.test (although it's far from the fastest)
  • Joris Meys
    Joris Meys about 13 years
    that's a clean vectorization. How so, messy? ;)
  • Joshua Ulrich
    Joshua Ulrich about 13 years
    @Joris: messy in terms of many lines of code. I guess a rowVars function would clean it up.
  • Edd
    Edd about 13 years
    Ooh sneaky, I like it. Tyler's solution is more generalizable but unfortunately also slower :/. I wonder if there is an efficient general solution.
  • Joshua Ulrich
    Joshua Ulrich about 13 years
    This solution is ~1.5x slower than the OP's solution.
  • Joshua Ulrich
    Joshua Ulrich about 13 years
    @Edd: the most efficient general solution would require a very efficient t.test.stat function. Even then, you would have to evaluate it 9000 times, which would still be much slower than my less general solution. Opportunity cost strikes again! ;-)

Related