aggregate methods treat missing values (NA) differently

102,737

Solution 1

Good question, but in my opinion, this shouldn't have caused a major debugging headache because it is documented quite clearly in multiple places in the manual page for aggregate.

First, in the usage section:

## S3 method for class 'formula'
aggregate(formula, data, FUN, ...,
          subset, na.action = na.omit)

Later, in the description:

na.action: a function which indicates what should happen when the data contain NA values. The default is to ignore missing values in the given variables.


I can't answer why the formula mode was written differently---that's something the function authors would have to answer---but using the above information, you can probably use the following:

aggregate(.~Name, M, FUN=sum, na.rm=TRUE, na.action=NULL)
#   Name Col1 Col2
# 1 name    1    2

Solution 2

If you want the formula version to be equivalent try this:

M = data.frame( Name = rep('name',5), Col1 = c(NA,rep(1,4)) , Col2 = rep(1,5))
aggregate(. ~ Name, M, function(x) sum(x, na.rm=TRUE), na.action = na.pass)
Share:
102,737
Ryan Walker
Author by

Ryan Walker

Updated on July 09, 2022

Comments

  • Ryan Walker
    Ryan Walker almost 2 years

    Here's a simple data frame with a missing value:

    M = data.frame( Name = c('name', 'name'), Col1 = c(NA, 1) , Col2 = c(1, 1))
    #   Name Col1 Col2
    # 1 name   NA    1
    # 2 name    1    1
    

    When I use aggregate to sum variables by group ('Name') using the formula method:

    aggregate(. ~ Name, M, FUN = sum, na.rm = TRUE)

    the result is:

    # RowName Col1 Col2
    #    name    1    1
    

    So the entire first row, which have an NA, is ignored. But if use the "non-formula" specification:

    aggregate(M[, 2:3], by = list(M$Name), FUN = sum, na.rm = TRUE)

    the result is:

    # Group.1 Col1 Col2
    #    name    1    2
    

    Here only the (1,1) entry is ignored.

    This caused a major debugging headache in one of my codes, since I thought these two calls were equivalent. Is there a good reason why the formula entry method is treated differently?

    Thanks.

  • eddi
    eddi almost 11 years
    -1 for the first sentence (sure it looks easy now that you know exactly what you're looking for, but this would be smth quite non-trivial to find irl)
  • A5C1D2H2I1M1N2O1R2T1
    A5C1D2H2I1M1N2O1R2T1 almost 11 years
    @eddi, no problem. I know from your chat and comment histories that you like functions to work like you want them to rather than how they are documented, and you are entirely open to that opinion.
  • Josh O'Brien
    Josh O'Brien almost 11 years
    @eddi -- Really, a downvote for that?? I think Ananda makes a worthwhile point there... Carefully reading the help docs, sooner rather than later, is a very good habit to learn, and will save many headaches down the road!
  • eddi
    eddi almost 11 years
    @AnandaMahto - haha, rather I like functions to be consistent across different use cases; but I elaborated more on the -1 above - it has more to do with you thinking that this is easy to find, just because there is mention of this (again, inconsistent) behavior in the manual
  • A5C1D2H2I1M1N2O1R2T1
    A5C1D2H2I1M1N2O1R2T1 almost 11 years
    +1, but anonymous function not required: aggregate(.~Name, M, FUN=sum, na.rm=TRUE, na.action="na.pass") works too.
  • Josh O'Brien
    Josh O'Brien almost 11 years
    @eddi -- Sounds like you'd actually like to downvote the author of aggregate.formula ;) But, given that methods sometimes do use inconsistent defaults, where else than the manual should they be documented? The positive value of Ananda's comment is that it reminds the OP (and others) that, in this inconsistent world of ours, reading the manual saves headaches!
  • eddi
    eddi almost 11 years
    @JoshO'Brien I really would :) And Anando got it for endorsing their bad behavior. The reason I downvoted this answer is because while it is true that "reading the manual can save headaches", I have a hard time imagining how it would here. The way you become aware of this particular issue is likely through pain and not through reading the manual. You can use the manual later to confirm the source of your pain of course, but then that manual should be regarded as a badly behaving child rather than some sort of a bible to be put on a pedestal. /end of nonsensical comparisons
  • eddi
    eddi almost 11 years
    *Ananda, sorry for misspelling
  • big_m
    big_m over 8 years
    FWIW, when I read the documentation quoted, I would interpret that to mean that just the NA values are removed, not entire rows where there are any NAs. Perhaps a more experienced R user would find it obvious, but I did not. All that would really be necessary to say is to use na.action=na.pass. That was the solution I was looking for (in a similar situation to the asker).
  • big_m
    big_m over 8 years
    Thanks for pointing out na.pass. That's a little clearer than NULL (though both seem to work).
  • Pladiona
    Pladiona over 2 years
    May I just add that the documentation is not so good? I am just arriving to this AFTER reading it. It is clear what the function is, but what are the options? na.action = na.omit returns to me "invalid 'type' (closure) of argument". Is there anywhere with a proper documentation about aggregate or na.omit that explains well its use? Would be very grateful for any leads...