Aggregate a dataframe on a given column and display another column

119,177

Solution 1

First, you split the data using split:

split(z,z$Group)

Than, for each chunk, select the row with max Score:

lapply(split(z,z$Group),function(chunk) chunk[which.max(chunk$Score),])

Finally reduce back to a data.frame do.calling rbind:

do.call(rbind,lapply(split(z,z$Group),function(chunk) chunk[which.max(chunk$Score),]))

Result:

  Group Score Info
1     1     3    c
2     2     4    d

One line, no magic spells, fast, result has good names =)

Solution 2

A late answer, but and approach using data.table

library(data.table)
DT <- data.table(dat)

DT[, .SD[which.max(Score),], by = Group]

Or, if it is possible to have more than one equally highest score

DT[, .SD[which(Score == max(Score)),], by = Group]

Noting that (from ?data.table

.SD is a data.table containing the Subset of x's Data for each group, excluding the group column(s)

Solution 3

To add to Gavin's answer: prior to the merge, it is possible to get aggregate to use proper names when not using the formula interface:

aggregate(data[,"score", drop=F], list(group=data$group), mean) 

Solution 4

The plyr package can be used for this. With the ddply() function you can split a data frame on one or more columns and apply a function and return a data frame, then with the summarize() function you can use the columns of the splitted data frame as variables to make the new data frame/;

dat <- read.table(textConnection('Group Score Info
1     1     1    a
2     1     2    b
3     1     3    c
4     2     4    d
5     2     3    e
6     2     1    f'))

library("plyr")

ddply(dat,.(Group),summarize,
    Max = max(Score),
    Info = Info[which.max(Score)])
  Group Max Info
1     1   3    c
2     2   4    d

Solution 5

This is how I baseically think of the problem.

my.df <- data.frame(group = rep(c(1,2), each = 3), 
        score = runif(6), info = letters[1:6])
my.agg <- with(my.df, aggregate(score, list(group), max))
my.df.split <- with(my.df, split(x = my.df, f = group))
my.agg$info <- unlist(lapply(my.df.split, FUN = function(x) {
            x[which(x$score == max(x$score)), "info"]
        }))

> my.agg
  Group.1         x info
1       1 0.9344336    a
2       2 0.7699763    e
Share:
119,177
jul635
Author by

jul635

Updated on June 19, 2020

Comments

  • jul635
    jul635 almost 4 years

    I have a dataframe in R of the following form:

    > head(data)
      Group Score Info
    1     1     1    a
    2     1     2    b
    3     1     3    c
    4     2     4    d
    5     2     3    e
    6     2     1    f
    

    I would like to aggregate it following the Score column using the max function

    > aggregate(data$Score, list(data$Group), max)
    
      Group.1         x
    1       1         3
    2       2         4
    

    But I also would like to display the Info column associated to the maximum value of the Score column for each group. I have no idea how to do this. My desired output would be:

      Group.1         x        y
    1       1         3        c
    2       2         4        d
    

    Any hint?

  • Andrie
    Andrie almost 13 years
    Nice, but I think you'll agree my solution is slightly more general since it will return all columns in the original data.frame.
  • Sacha Epskamp
    Sacha Epskamp almost 13 years
    Yeah indeed. You could change x$Score==max in which,max if you assume that there is only one maximum per group.
  • Andrie
    Andrie almost 13 years
    Ah, that's nice. I didn't know about which.max
  • hhh
    hhh about 7 years
    How can you calculate standard deviation only of the last six values, not of the whole column? Do you firstly need to aggregate the data or does there exist something like SQL-style OVER (PARTITION ... BETWEEN 5 PRECEDING AND CURRENT ROW in R? I am trying to answer this here but stuck.
  • jangorecki
    jangorecki almost 5 years
    @hhh this looks like rolling sd, please upvote github.com/Rdatatable/data.table/issues/2778