Grouping functions (tapply, by, aggregate) and the *apply family
Solution 1
R has many *apply functions which are ably described in the help files (e.g. ?apply
). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.
Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr
package, the base functions remain useful and worth knowing.
This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.
-
apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.
# Two dimensional matrix M <- matrix(seq(1,16), 4, 4) # apply min to rows apply(M, 1, min) [1] 1 2 3 4 # apply max to columns apply(M, 2, max) [1] 4 8 12 16 # 3 dimensional array M <- array( seq(32), dim = c(4,4,2)) # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension apply(M, 1, sum) # Result is one-dimensional [1] 120 128 136 144 # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension apply(M, c(1,2), sum) # Result is two-dimensional [,1] [,2] [,3] [,4] [1,] 18 26 34 42 [2,] 20 28 36 44 [3,] 22 30 38 46 [4,] 24 32 40 48
If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick
colMeans
,rowMeans
,colSums
,rowSums
. -
lapply - When you want to apply a function to each element of a list in turn and get a list back.
This is the workhorse of many of the other *apply functions. Peel back their code and you will often find
lapply
underneath.x <- list(a = 1, b = 1:3, c = 10:100) lapply(x, FUN = length) $a [1] 1 $b [1] 3 $c [1] 91 lapply(x, FUN = sum) $a [1] 1 $b [1] 6 $c [1] 5005
-
sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.
If you find yourself typing
unlist(lapply(...))
, stop and considersapply
.x <- list(a = 1, b = 1:3, c = 10:100) # Compare with above; a named vector, not a list sapply(x, FUN = length) a b c 1 3 91 sapply(x, FUN = sum) a b c 1 6 5005
In more advanced uses of
sapply
it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length,sapply
will use them as columns of a matrix:sapply(1:5,function(x) rnorm(3,x))
If our function returns a 2 dimensional matrix,
sapply
will do essentially the same thing, treating each returned matrix as a single long vector:sapply(1:5,function(x) matrix(x,2,2))
Unless we specify
simplify = "array"
, in which case it will use the individual matrices to build a multi-dimensional array:sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.
-
vapply - When you want to use
sapply
but perhaps need to squeeze some more speed out of your code or want more type safety.For
vapply
, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.x <- list(a = 1, b = 1:3, c = 10:100) #Note that since the advantage here is mainly speed, this # example is only for illustration. We're telling R that # everything returned by length() should be an integer of # length 1. vapply(x, FUN = length, FUN.VALUE = 0L) a b c 1 3 91
-
mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in
sapply
.This is multivariate in the sense that your function must accept multiple arguments.
#Sums the 1st elements, the 2nd elements, etc. mapply(sum, 1:5, 1:5, 1:5) [1] 3 6 9 12 15 #To do rep(1,4), rep(2,3), etc. mapply(rep, 1:4, 4:1) [[1]] [1] 1 1 1 1 [[2]] [1] 2 2 2 [[3]] [1] 3 3 [[4]] [1] 4
-
Map - A wrapper to
mapply
withSIMPLIFY = FALSE
, so it is guaranteed to return a list.Map(sum, 1:5, 1:5, 1:5) [[1]] [1] 3 [[2]] [1] 6 [[3]] [1] 9 [[4]] [1] 12 [[5]] [1] 15
-
rapply - For when you want to apply a function to each element of a nested list structure, recursively.
To give you some idea of how uncommon
rapply
is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV.rapply
is best illustrated with a user-defined function to apply:# Append ! to string, otherwise increment myFun <- function(x){ if(is.character(x)){ return(paste(x,"!",sep="")) } else{ return(x + 1) } } #A nested list structure l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), b = 3, c = "Yikes", d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5))) # Result is named vector, coerced to character rapply(l, myFun) # Result is a nested list like l, with values altered rapply(l, myFun, how="replace")
-
tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.
The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple.
A vector:
x <- 1:20
A factor (of the same length!) defining groups:
y <- factor(rep(letters[1:5], each = 4))
Add up the values in
x
within each subgroup defined byy
:tapply(x, y, sum) a b c d e 10 26 42 58 74
More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors.
tapply
is similar in spirit to the split-apply-combine functions that are common in R (aggregate
,by
,ave
,ddply
, etc.) Hence its black sheep status.
Solution 2
On the side note, here is how the various plyr
functions correspond to the base *apply
functions (from the intro to plyr document from the plyr webpage http://had.co.nz/plyr/)
Base function Input Output plyr function
---------------------------------------
aggregate d d ddply + colwise
apply a a/l aaply / alply
by d l dlply
lapply l l llply
mapply a a/l maply / mlply
replicate r a/l raply / rlply
sapply l a laply
One of the goals of plyr
is to provide consistent naming conventions for each of the functions, encoding the input and output data types in the function name. It also provides consistency in output, in that output from dlply()
is easily passable to ldply()
to produce useful output, etc.
Conceptually, learning plyr
is no more difficult than understanding the base *apply
functions.
plyr
and reshape
functions have replaced almost all of these functions in my every day use. But, also from the Intro to Plyr document:
Related functions
tapply
andsweep
have no corresponding function inplyr
, and remain useful.merge
is useful for combining summaries with the original data.
Solution 3
From slide 21 of http://www.slideshare.net/hadley/plyr-one-data-analytic-strategy:
(Hopefully it's clear that apply
corresponds to @Hadley's aaply
and aggregate
corresponds to @Hadley's ddply
etc. Slide 20 of the same slideshare will clarify if you don't get it from this image.)
(on the left is input, on the top is output)
Solution 4
First start with Joran's excellent answer -- doubtful anything can better that.
Then the following mnemonics may help to remember the distinctions between each. Whilst some are obvious, others may be less so --- for these you'll find justification in Joran's discussions.
Mnemonics
-
lapply
is a list apply which acts on a list or vector and returns a list. -
sapply
is a simplelapply
(function defaults to returning a vector or matrix when possible) -
vapply
is a verified apply (allows the return object type to be prespecified) -
rapply
is a recursive apply for nested lists, i.e. lists within lists -
tapply
is a tagged apply where the tags identify the subsets -
apply
is generic: applies a function to a matrix's rows or columns (or, more generally, to dimensions of an array)
Building the Right Background
If using the apply
family still feels a bit alien to you, then it might be that you're missing a key point of view.
These two articles can help. They provide the necessary background to motivate the functional programming techniques that are being provided by the apply
family of functions.
Users of Lisp will recognise the paradigm immediately. If you're not familiar with Lisp, once you get your head around FP, you'll have gained a powerful point of view for use in R -- and apply
will make a lot more sense.
- Advanced R: Functional Programming, by Hadley Wickham
- Simple Functional Programming in R, by Michael Barton
Solution 5
Since I realized that (the very excellent) answers of this post lack of by
and aggregate
explanations. Here is my contribution.
BY
The by
function, as stated in the documentation can be though, as a "wrapper" for tapply
. The power of by
arises when we want to compute a task that tapply
can't handle. One example is this code:
ct <- tapply(iris$Sepal.Width , iris$Species , summary )
cb <- by(iris$Sepal.Width , iris$Species , summary )
cb
iris$Species: setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
--------------------------------------------------------------
iris$Species: versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
--------------------------------------------------------------
iris$Species: virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
ct
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
$versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
$virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
If we print these two objects, ct
and cb
, we "essentially" have the same results and the only differences are in how they are shown and the different class
attributes, respectively by
for cb
and array
for ct
.
As I've said, the power of by
arises when we can't use tapply
; the following code is one example:
tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) :
arguments must have same length
R says that arguments must have the same lengths, say "we want to calculate the summary
of all variable in iris
along the factor Species
": but R just can't do that because it does not know how to handle.
With the by
function R dispatch a specific method for data frame
class and then let the summary
function works even if the length of the first argument (and the type too) are different.
bywork <- by(iris, iris$Species, summary )
bywork
iris$Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 versicolor: 0
Median :5.000 Median :3.400 Median :1.500 Median :0.200 virginica : 0
Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
--------------------------------------------------------------
iris$Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
--------------------------------------------------------------
iris$Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400 setosa : 0
1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800 versicolor: 0
Median :6.500 Median :3.000 Median :5.550 Median :2.000 virginica :50
Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
it works indeed and the result is very surprising. It is an object of class by
that along Species
(say, for each of them) computes the summary
of each variable.
Note that if the first argument is a data frame
, the dispatched function must have a method for that class of objects. For example is we use this code with the mean
function we will have this code that has no sense at all:
by(iris, iris$Species, mean)
iris$Species: setosa
[1] NA
-------------------------------------------
iris$Species: versicolor
[1] NA
-------------------------------------------
iris$Species: virginica
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
AGGREGATE
aggregate
can be seen as another a different way of use tapply
if we use it in such a way.
at <- tapply(iris$Sepal.Length , iris$Species , mean)
ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)
at
setosa versicolor virginica
5.006 5.936 6.588
ag
Group.1 x
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
The two immediate differences are that the second argument of aggregate
must be a list while tapply
can (not mandatory) be a list and that the output of aggregate
is a data frame while the one of tapply
is an array
.
The power of aggregate
is that it can handle easily subsets of the data with subset
argument and that it has methods for ts
objects and formula
as well.
These elements make aggregate
easier to work with that tapply
in some situations.
Here are some examples (available in documentation):
ag <- aggregate(len ~ ., data = ToothGrowth, mean)
ag
supp dose len
1 OJ 0.5 13.23
2 VC 0.5 7.98
3 OJ 1.0 22.70
4 VC 1.0 16.77
5 OJ 2.0 26.06
6 VC 2.0 26.14
We can achieve the same with tapply
but the syntax is slightly harder and the output (in some circumstances) less readable:
att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)
att
OJ VC
0.5 13.23 7.98
1 22.70 16.77
2 26.06 26.14
There are other times when we can't use by
or tapply
and we have to use aggregate
.
ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
ag1
Month Ozone Temp
1 5 23.61538 66.73077
2 6 29.44444 78.22222
3 7 59.11538 83.88462
4 8 59.96154 83.96154
5 9 31.44828 76.89655
We cannot obtain the previous result with tapply
in one call but we have to calculate the mean along Month
for each elements and then combine them (also note that we have to call the na.rm = TRUE
, because the formula
methods of the aggregate
function has by default the na.action = na.omit
):
ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)
ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)
cbind(ta1, ta2)
ta1 ta2
5 23.61538 65.54839
6 29.44444 79.10000
7 59.11538 83.90323
8 59.96154 83.96774
9 31.44828 76.90000
while with by
we just can't achieve that in fact the following function call returns an error (but most likely it is related to the supplied function, mean
):
by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)
Other times the results are the same and the differences are just in the class (and then how it is shown/printed and not only -- example, how to subset it) object:
byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)
aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)
The previous code achieve the same goal and results, at some points what tool to use is just a matter of personal tastes and needs; the previous two objects have very different needs in terms of subsetting.
grautur
Updated on June 05, 2021Comments
-
grautur almost 3 years
Whenever I want to do something "map"py in R, I usually try to use a function in the
apply
family.However, I've never quite understood the differences between them -- how {
sapply
,lapply
, etc.} apply the function to the input/grouped input, what the output will look like, or even what the input can be -- so I often just go through them all until I get what I want.Can someone explain how to use which one when?
My current (probably incorrect/incomplete) understanding is...
sapply(vec, f)
: input is a vector. output is a vector/matrix, where elementi
isf(vec[i])
, giving you a matrix iff
has a multi-element outputlapply(vec, f)
: same assapply
, but output is a list?apply(matrix, 1/2, f)
: input is a matrix. output is a vector, where elementi
is f(row/col i of the matrix)tapply(vector, grouping, f)
: output is a matrix/array, where an element in the matrix/array is the value off
at a groupingg
of the vector, andg
gets pushed to the row/col namesby(dataframe, grouping, f)
: letg
be a grouping. applyf
to each column of the group/dataframe. pretty print the grouping and the value off
at each column.aggregate(matrix, grouping, f)
: similar toby
, but instead of pretty printing the output, aggregate sticks everything into a dataframe.
Side question: I still haven't learned plyr or reshape -- would
plyr
orreshape
replace all of these entirely?-
JD Long over 13 yearsto your side question: for many things plyr is a direct replacement for
*apply()
andby
. plyr (at least to me) seems much more consistent in that I always know exactly what data format it expects and exactly what it will spit out. That saves me a lot of hassle. -
Iterator over 12 yearsAlso, I'd recommend adding:
doBy
and the selection & apply capabilities ofdata.table
. -
IRTFM over 11 years
sapply
is justlapply
with the addition ofsimplify2array
on the output.apply
does coerce to atomic vector, but output can be vector or list.by
splits dataframes into sub-dataframes, but it doesn't usef
on columns separately. Only if there is a method for 'data.frame'-class mightf
get column-wise applied byby
.aggregate
is generic so different methods exist for different classes of the first argument. -
Lutz Prechelt over 9 yearsMnemonic: l is for 'list', s is for 'simplifying', t is for 'per type' (each level of the grouping is a type)
-
Stefanos over 5 yearsThere also exist some functions in the package Rfast, like: eachcol.apply, apply.condition, and more, which are faster than R's equivalents
-
JD Long over 13 yearsWhen I started learning R from scratch I found plyr MUCH easier to learn than the
*apply()
family of functions. For me,ddply()
was very intuitive as I was familiar with SQL aggregation functions.ddply()
became my hammer for solving many problems, some of which could have been better solved with other commands. -
IRTFM over 12 yearsBelieve you will find that
by
is pure split-lapply andaggregate
istapply
at their cores. I think black sheep make excellent fabric. -
grautur over 12 yearsFantastic response! This should be part of the official R documentation :). One tiny suggestion: perhaps add some bullets on using
aggregate
andby
as well? (I finally understand them after your description!, but they're pretty common, so it might be useful to separate out and have some specific examples for those two functions.) -
isomorphismes over 12 yearsYeah ... what is a ragged array, anyway?
-
Frank almost 7 yearsIs this related to grouping?
-
vonjd almost 7 years@Frank: Well, to be honest with you the title of this post is rather misleading: when you read the question itself it is about "the apply family".
sweep
is a higher-order function like all the others mentioned here, e.g.apply
,sapply
,lapply
So the same question could be asked about the accepted answer with over 1,000 upvotes and the examples given therein. Just have a look at the example given forapply
there. -
onyambu over 6 yearsAs I've said, the power of by arises when we can't use tapply; the following code is one example: THIS ARE THE WORDS YOU HAVE USED ABOVE. And you have given an example of computing the summary. Well lets say that the summary statistics can be computed only that it will need cleaning: eg
data.frame(tapply(unlist(iris[,-5]),list(rep(iris[,5],ncol(iris[-5])),col(iris[-5])),summary))
this is a use of tapply. With the right splitting there is nothing you cant do with
tapply. The only thing is it returns a matrix. Please be careful by saying we cant use
tapply` -
moodymudskipper about 6 yearssweep has a misleading name, misleading defaults, and misleading parameter name :). It's easier for me to understand it this way : 1) STATS is vector or single value that will be repeated to form a matrix of the same size as first input, 2) FUN will be applied on 1st input and this new matrix. Maybe better illustrated by :
sweep(matrix(1:6,nrow=2),2,7:9,list)
. It's usually more efficient thanapply
because whereapply
loops,sweep
is able to use vectorised functions.