Use of ddply + mutate with a custom function?
You're mostly right. ddply
indeed breaks your data down into mini data frames based on the grouper, and applies a function to each piece.
With ddply
, all the work is done with data frames, so the .fun
argument must take a (mini) data frame as input and return a data frame as output.
mutate
and summarize
are functions that fit this bill (they take and return data frames). You can view their individual help pages, or run them on a data frame outside of ddply
to see this, e.g.
mutate(mtcars, mean.mpg = mean(mpg))
summarize(mtcars, mean.mpg = mean(mpg))
If you don't use mutate
or summarize
, that is, you only use a custom function, then your function also needs to take a (mini) data frame as argument, and return a data frame.
If you do use mutate
or summarize
, any other functions you pass to ddply
aren't used by ddply
, they're just passed on to be used by mutate
or summarize
. And functions used by mutate
and summarize
act on the columns of the data, not on the entire data.frame. This is why
ddply(mtcars, "cyl", mutate, mean.mpg = mean(mpg))
Notice that we don't pass mutate
a function. We don't say ddply(mtcars, "cyl", mutate, mean)
. We have to tell it what to take the mean of. In ?mutate
, the description of ...
is "named parameters giving definitions of new columns", not anything to do with functions. (Is mean()
really different from any "custom function"? No.)
Thus it doesn't work with anonymous functions--or functions at all. Pass it an expression! You can define a custom function beforehand.
custom_function <- function(x) {mean(x + runif(length(x))}
ddply(mtcars, "cyl", mutate, jittered.mean.mpg = custom_function(mpg))
ddply(mtcars, "cyl", summarize, jittered.mean.mpg = custom_function(mpg))
This extends well, you can have functions that take multiple arguments, and you can give them different columns as arguments, but if you're using the mutate
or summarize
, you have to give the other functions arguments; you're not just passing the functions.
You seem to want to pass ddply
a function that already "knows" which column to take the mean of. For that, I think you'd need to not use mutate
or summarize
, but you can hack your own version. For summarize
-like behavior, return a data.frame with a single value, for mutate
-like behavior, return the original data.frame with your extra value cbind
ed on
mean.mpg.mutate = function(df) {
cbind.data.frame(df, mean.mpg = mean(df$mpg))
}
mean.mpg.summarize = function(df) {
data.frame(mean.mpg = mean(df$mpg))
}
ddply(mtcars, "cyl", mean.mpg.mutate)
ddply(mtcars, "cyl", mean.mpg.summarize)
tl;dr
Why can't I use mutate with a custom function? Is it just that "built-in" functions return some sort of class that ddply can deal with vs. having to kick out a full data.frame and then call out only the column I care about?
Quite the opposite! mutate
and summarize
take data frames as inputs and kick out data frames as returns. But mutate and summarize are the functions you're passing to ddply, not mean or whatever else.
Mutate and summarize are convenience functions that you'll use 99% of the time you use ddply
.
If you don't use mutate/summarize, then your function needs to take and return a data frame.
If you do use mutate/summarize, then you don't pass them functions, you pass them expressions that can be evaluated with your (mini) data frame. If it's mutate, the return should be a vector to be appended to the data (recycled as necessary). If it's summarize, the return should be a single value. You don't pass a function, like mean
; you pass an expression, like mean(mpg)
.
What about dplyr
?
This was written before dplyr
was a thing, or at least a big thing. dplyr
removes a lot of the confusion from this process because it essentially replaces the nesting of ddply
with mutate
or summarize
as arguments with sequential functions group_by
followed by mutate
or summarize
. The dplyr
version of my answer would be
library(dplyr)
group_by(mtcars, cyl) %>%
mutate(mean.mpg = mean(mpg))
With the new column creation passed directly to mutate
(or summarize
), there isn't confusion about which function does what.
Hendy
Updated on July 01, 2022Comments
-
Hendy almost 2 years
I use
ddply
quite frequently, but historically withsummarize
(occasionallymutate
) and only basic functions likemean()
,var1 - var2
, etc. I have a dataset in which I'm trying to apply a custom, more involved function and started trying to dig into how to do this withddply
. I've got a successful solution, but I don't understand why it works like this vs. for more "normal" functions.Related
- Custom Function not recognized by ddply {plyr}...
- How do I pass variables to a custom function in ddply?
- r-help: [R] Correct use of ddply with own function (I ended up basing my solution on this)
Here's an example data set:
library(plyr) df <- data.frame(id = rep(letters[1:3], each = 3), value = 1:9)
Normally, I'd use
ddply
like so:df_ply_1 <- ddply(df, .(id), mutate, mean = mean(value))
My visualization of this is that
ddply
splitsdf
into "mini" data frames based on grouped combos ofid
, and then I add a new column by callingmean()
on a column name that exists indf
. So, my attempt to implement a function extended this idea:# actually, my logical extension of the above was to use: # ddply(..., mean = function(value) { mean(value) }) df_ply_2 <- ddply(df, .(id), mutate, mean = function(df) { mean(df$value) }) Error: attempt to replicate an object of type 'closure'
All the help on custom functions don't apply
mutate
, but that seems inconsistent, or at least annoying to me, as the analog to my implemented solution is:df_mean <- function(df) { temp <- data.frame(mean = rep(mean(df$value), nrow(df))) temp } df_ply_3 <- df df_ply_3$mean <- ddply(df, .(id), df_mean)$mean
In-line, looks like I have to do this:
df_ply_4 <- df df_ply_4$mean <- ddply(df, .(id), function(x) { temp <- data.frame(mean = rep(mean(x$value), length(x$value))) temp})$mean
Why can't I use
mutate
with a custom function? Is it just that "built-in" functions return some sort of class thatddply
can deal with vs. having to kick out a fulldata.frame
and then call out only the column I care about?Thanks for helping me "get it"!
Update after @Gregor's answer
Awesome answer, and I think I now get it. I was, indeed, confused about what
mutate
andsummarize
meant... thinking they were arguments toddply
regarding how to handle the result vs. actually being the functions themselves. So, thanks for that big insight.Also, it really helped to understand that without
mutate/summarize
, I need to return adata.frame
, which is the reason I have tocbind
a column with the name of the column in thedf
that gets returned.Lastly if I do use
mutate
, it's helpful to now realize I can return a vector result and get the right result. Thus, I can do this, which I've now understood after reading your answer:# I also caught that the code above doesn't do the right thing # and recycles the single value returned by mean() vs. repeating it like # I expected. Now that I know it's taking a vector, I know I need to return # a vector the same length as my mini df custom_mean <- function(x) { rep(mean(x), length(x)) } df_ply_5 <- ddply(df, .(id), mutate, mean = custom_mean(value))
Thanks again for your in-depth answer!
Update per @Gregor's last comment
Hmmm. I used
rep(mean(x), length(x))
due to this observation fordf_ply_3
's result (I admit to not actually looking at it closely when I ran it the first time making this post, I just saw that it didn't give me an error!):df_mean <- function(x) { data.frame(mean = mean(x$value)) } df_ply_3 <- df df_ply_3$mean <- ddply(df, .(id), df_mean)$mean df_ply_3 id value mean 1 a 1 2 2 a 2 5 3 a 3 8 4 b 4 2 5 b 5 5 6 b 6 8 7 c 7 2 8 c 8 5 9 c 9 8
So, I'm thinking that my code was actually an accident based on the fact that I had 3
id
variables repeated 3 times. Thus the actual return was the equivalent ofsummarize
(one row perid
value), and recycled. Testing that theory appears accurate if I update my data frame like so:df <- data.frame(id = c(rep(letters[1:3], each = 3), "d"), value = 1:10)
I get an error when trying to use the
df_ply_3
method withdf_mean()
:Error in `$<-.data.frame`(`*tmp*`, "mean", value = c(2, 5, 8, 10)) : replacement has 4 rows, data has 10
So, the mini df passed to
df_mean
returns adf
wheremean
is the result of taking the mean if thevalue
vector (returns one value). So, my output was just adata.frame
of three values, one perid
group. I'm thinking themutate
way sort of "remembers" that it was passed a mini data frame, and then repeats the single output to match it's length?In any case, thanks for commenting on
df_ply_5
; indeed, if I remove therep()
bit and just returnmean(x)
, it works great!-
baptiste over 9 yearsmaybe you want
ddply(df, .(id), function(d) mutate(d, mean = mean(value)))
-
Gregor Thomas over 9 yearsLooks like you've got it pretty well! But regarding your
custom_mean
function... thanks to recycling if you want the same value multiple times you can just return one value, it's a nice feature! Notice that yourdf_ply_1
, yourdf_ply_5
and @baptiste's comment code are all slightly different, but the returns are all the same. -
Hendy over 9 years@Gregor Updated with another section based on your comment. Yup, that works and now I think I get why.
-
Hendy over 9 years@baptiste Awesome! Now that I Understand that
mutate
actually is the function passed toddply()
(vs. my thought that it was some argument tellingddply()
how to return the result), it makes sense that I could call it like that as the function vs. trying to specify an additional function to "mutate on."
-
IRTFM over 9 yearsThe basic problem with the questioner's 'custom function' was that it was attempting to work on an object from the global environment that was "too big" for the multiple smaller local environments. Seems like the
mutate
function should be throwing a more informative error message. -
Gregor Thomas over 9 yearsI agree the error message was pretty unhelpful, there's plenty of cases where passing objects from the global environment (usually as extra arguments) to the function is exactly what is needed, so I don't see an obvious solution.
-
IRTFM over 9 yearsMaybe
mutate
should just produce the error message: "Don't send functions to me." -
Hendy over 9 yearsAwesome answer, and thanks so much for your assistance. I'm newer to functions, and think another issue was assuming that defining an "in-line" function would pick up the name of the mini data.frame (as in
mean = function(value) { mean(value) }
would passmini.df$value
, whereas it's just an anonymous thing calledvalue
that it's passing. Or at least I think that's what's going on.mutate
doesn't find the thing calledvalue
and I'm assuming typeclosure
means "doesn't exist." In any case, thanks again. This was perfect and will hopefully serve many others well! -
Gregor Thomas over 9 yearsGlad it helped! Your comment is incorrectly using "closure" and "anonymous function", the best source I know for that is Hadley's Advanced R Book (that link is to the Functional Programming Chapter, which has sections on both those terms).
-
Celeste about 8 years@Gregor have I completely misunderstood the above or is it not possible to pass other objects to mutate other than the df in question. I.e. if I have a used defined function of the form MyFun(df, ob1, ob2), the following will not work df%>%rowwise()%>%mutate(X=MyFun(ob1,ob2)). Why?
-
Gregor Thomas about 8 yearsIn that chain you could probably just do
%>% mutate(., X = MyFun(., ob1, ob2))
. But the real question is whyMyFun
needs a data frame? If it is made for adding columns to data frames, then it should return the data frame with the added column and you don't need mutate:%>% MyFun(ob1, ob2)
. -
Celeste about 8 yearsThanks @Gregor I changed it to the latter, however I now get error: (list) object cannot be coerced to type double. I suspect it has to do with the way MyFun is written, i.e., it uses df[,"Col"] which the dplyr solution is taking to mean all rows, not row by row like when one uses apply(df,2,fun)
-
Gregor Thomas about 8 yearsAgreed, it's probably how the function is written. Ask a new question if you're having trouble. That way you can show the function (or a simplified version).