Blend of na.omit and na.pass using aggregate?

10,436

Pass both na.action=na.pass and na.rm=TRUE to aggregate. The former tells aggregate not to delete rows where NAs exist; and the latter tells mean to ignore them.

aggregate(cbind(var1, var2, var3) ~ name, test, mean,
          na.action=na.pass, na.rm=TRUE)
Share:
10,436
Hendy
Author by

Hendy

Updated on June 04, 2022

Comments

  • Hendy
    Hendy almost 2 years

    I have a data set containing product prototype test data. Not all tests were run on all lots, and not all tests were executed with the same sample sizes. To illustrate, consider this case:

    > test <- data.frame(name = rep(c("A", "B", "C"), each = 4),
      var1 = rep(c(1:3, NA), 3),
      var2 = 1:12,
      var3 = c(rep(NA, 4), 1:8))
    
    > test
       name var1 var2 var3
    1     A    1    1   NA
    2     A    2    2   NA
    3     A    3    3   NA
    4     A   NA    4   NA
    5     B    1    5    1
    6     B    2    6    2
    7     B    3    7    3
    8     B   NA    8    4
    9     C    1    9    5
    10    C    2   10    6
    11    C    3   11    7
    12    C   NA   12    8
    

    In the past, I've only had to deal with cases of mis-matched repetitions, which has been easy with aggregate(cbind(var1, var2) ~ name, test, FUN = mean, na.action = na.omit) (or the default setting). I'll get averages for each lot over three values for var1 and over four values for var2.

    Unfortunately, this will leave me with a dataset completely missing lot A in this case:

     aggregate(cbind(var1, var2, var3) ~ name, test, FUN = mean, na.action = na.omit)
      name var1 var2 var3
    1    B    2    6    2
    2    C    2   10    6
    

    If I use na.pass, however, I also don't get what I want:

     aggregate(cbind(var1, var2, var3) ~ name, test, FUN = mean, na.action = na.pass)
      name var1 var2 var3
    1    A   NA  2.5   NA
    2    B   NA  6.5  2.5
    3    C   NA 10.5  6.5
    

    Now I lose the good data I had in var1 since it contained instances of NA.

    What I'd like is:

    • NA as the output of mean() if all unique combinations of varN ~ name are NAs
    • Output of mean() if there are one or more actual values for varN ~ name

    I'm guessing this is pretty simple, but I just don't know how. Do I need to use ddply for something like this? If so... the reason I tend to avoid it is that I end up writing really long equivalents to aggregate() like so:

    ddply(test, .(name), summarise,
      var1 = mean(var1, na.rm = T),
      var2 = mean(var2, na.rm = T),
      var3 = mean(var3, na.rm = T))
    

    Yeah... so the result of that apparently does what I want. I'll leave the question anyway in case there's 1) a way to do this with aggregate() or 2) shorter syntax for ddply.

  • Hendy
    Hendy almost 11 years
    Awesome, and I had no idea that was possible.
  • Scott Worland
    Scott Worland about 8 years
    @HongOoi This worked great. Just something to note, this will replace the NAs with zeros depending on which function you choose. This is most likely not the final result you want so just be sure to follow up with something like df[df == 0] <- NA. If you have real zeros in your df that you don't want removed, then combine the code above with is.na(df)
  • colin
    colin over 6 years
    heads up this returns NaN, rather than NA for name A, var3