Aggregate a dataframe on a given column and display another column

r aggregate plyr greatest-n-per-group

119,177

Solution 1

First, you split the data using split:

split(z,z$Group)

Than, for each chunk, select the row with max Score:

lapply(split(z,z$Group),function(chunk) chunk[which.max(chunk$Score),])

Finally reduce back to a data.frame do.calling rbind:

do.call(rbind,lapply(split(z,z$Group),function(chunk) chunk[which.max(chunk$Score),]))

Result:

  Group Score Info
1     1     3    c
2     2     4    d

One line, no magic spells, fast, result has good names =)

Solution 2

A late answer, but and approach using data.table

library(data.table)
DT <- data.table(dat)

DT[, .SD[which.max(Score),], by = Group]

Or, if it is possible to have more than one equally highest score

DT[, .SD[which(Score == max(Score)),], by = Group]

Noting that (from ?data.table

.SD is a data.table containing the Subset of x's Data for each group, excluding the group column(s)

Solution 3

To add to Gavin's answer: prior to the merge, it is possible to get aggregate to use proper names when not using the formula interface:

aggregate(data[,"score", drop=F], list(group=data$group), mean)

Solution 4

The plyr package can be used for this. With the ddply() function you can split a data frame on one or more columns and apply a function and return a data frame, then with the summarize() function you can use the columns of the splitted data frame as variables to make the new data frame/;

dat <- read.table(textConnection('Group Score Info
1     1     1    a
2     1     2    b
3     1     3    c
4     2     4    d
5     2     3    e
6     2     1    f'))

library("plyr")

ddply(dat,.(Group),summarize,
    Max = max(Score),
    Info = Info[which.max(Score)])
  Group Max Info
1     1   3    c
2     2   4    d

Solution 5

This is how I baseically think of the problem.

my.df <- data.frame(group = rep(c(1,2), each = 3), 
        score = runif(6), info = letters[1:6])
my.agg <- with(my.df, aggregate(score, list(group), max))
my.df.split <- with(my.df, split(x = my.df, f = group))
my.agg$info <- unlist(lapply(my.df.split, FUN = function(x) {
            x[which(x$score == max(x$score)), "info"]
        }))

> my.agg
  Group.1         x info
1       1 0.9344336    a
2       2 0.7699763    e

View more solutions

119,177

Author by

jul635

Updated on June 19, 2020

Comments

jul635 almost 4 years

I have a dataframe in R of the following form:

> head(data)
  Group Score Info
1     1     1    a
2     1     2    b
3     1     3    c
4     2     4    d
5     2     3    e
6     2     1    f

I would like to aggregate it following the Score column using the max function

> aggregate(data$Score, list(data$Group), max)

  Group.1         x
1       1         3
2       2         4

But I also would like to display the Info column associated to the maximum value of the Score column for each group. I have no idea how to do this. My desired output would be:

  Group.1         x        y
1       1         3        c
2       2         4        d

Any hint?

Andrie almost 13 years

Nice, but I think you'll agree my solution is slightly more general since it will return all columns in the original data.frame.
Sacha Epskamp almost 13 years

Yeah indeed. You could change x$Score==max in which,max if you assume that there is only one maximum per group.
Andrie almost 13 years

Ah, that's nice. I didn't know about which.max
hhh about 7 years

How can you calculate standard deviation only of the last six values, not of the whole column? Do you firstly need to aggregate the data or does there exist something like SQL-style OVER (PARTITION ... BETWEEN 5 PRECEDING AND CURRENT ROW in R? I am trying to answer this here but stuck.
jangorecki almost 5 years

@hhh this looks like rolling sd, please upvote github.com/Rdatatable/data.table/issues/2778