Count number of rows within each group

326,052

Solution 1

Current best practice (tidyverse) is:

require(dplyr)
df1 %>% count(Year, Month)

Solution 2

Following @Joshua's suggestion, here's one way you might count the number of observations in your df dataframe where Year = 2007 and Month = Nov (assuming they are columns):

nrow(df[,df$YEAR == 2007 & df$Month == "Nov"])

and with aggregate, following @GregSnow:

aggregate(x ~ Year + Month, data = df, FUN = length)

Solution 3

dplyr package does this with count/tally commands, or the n() function:

First, some data:

df <- data.frame(x = rep(1:6, rep(c(1, 2, 3), 2)), year = 1993:2004, month = c(1, 1:11))

Now the count:

library(dplyr)
count(df, year, month)
#piping
df %>% count(year, month)

We can also use a slightly longer version with piping and the n() function:

df %>% 
  group_by(year, month) %>%
  summarise(number = n())

or the tally function:

df %>% 
  group_by(year, month) %>%
  tally()

Solution 4

An old question without a data.table solution. So here goes...

Using .N

library(data.table)
DT <- data.table(df)
DT[, .N, by = list(year, month)]

Solution 5

The simple option to use with aggregate is the length function which will give you the length of the vector in the subset. Sometimes a little more robust is to use function(x) sum( !is.na(x) ).

Share:
326,052
MikeTP
Author by

MikeTP

Energy commodity trader learning R to better analyze big high frequency time series data.

Updated on October 29, 2021

Comments

  • MikeTP
    MikeTP over 2 years

    I have a dataframe and I would like to count the number of rows within each group. I reguarly use the aggregate function to sum data as follows:

    df2 <- aggregate(x ~ Year + Month, data = df1, sum)
    

    Now, I would like to count observations but can't seem to find the proper argument for FUN. Intuitively, I thought it would be as follows:

    df2 <- aggregate(x ~ Year + Month, data = df1, count)
    

    But, no such luck.

    Any ideas?


    Some toy data:

    set.seed(2)
    df1 <- data.frame(x = 1:20,
                      Year = sample(2012:2014, 20, replace = TRUE),
                      Month = sample(month.abb[1:3], 20, replace = TRUE))
    
    • Joshua Ulrich
      Joshua Ulrich about 12 years
      nrow, NROW, length...
    • Hong Ooi
      Hong Ooi about 12 years
      I keep reading this question as asking for a fun way to count things (as opposed to the many unfun ways, I guess).
    • Prolix
      Prolix over 8 years
      @JoshuaUlrich: nrow did not work for me but NROW and lengthworked fine. +1
  • sop
    sop almost 9 years
    Is there a way to aggregate a variable and do counting too (like 2 functions in aggregation: mean + count)? I need to get the mean of a column and the number of rows for the same value in other column
  • geotheory
    geotheory almost 9 years
    I'd cbind the results of aggregate(Sepal.Length ~ Species, iris, mean) and aggregate(Sepal.Length ~ Species, iris, length)
  • sop
    sop almost 9 years
    I have done it, but it seems that I get 2 times each column except the one that is aggregated; so I have done a merge on them and it seems to be ok
  • Manoj Kumar
    Manoj Kumar over 7 years
    I don't know but this could be useful as well... df %>% group_by(group, variable) %>% mutate(count = n())
  • geotheory
    geotheory over 7 years
    Yes dplyr is best practice now.
  • thelatemail
    thelatemail almost 5 years
    Just to note that if you are using the default, non-formula method for aggregate, there is no need to rename each variable in by= like list(year=df1$year) etc. A data.frame is a list already so aggregate(df1[c("Count")], by=df1[c("Year", "Month")], FUN=sum, na.rm=TRUE) will work.
  • sindri_baldur
    sindri_baldur over 4 years
    standard nowadays to use .() instead of list() and setDT() to convert a data.frame to data.table. So in one step setDT(df)[, .N, by = .(year, month)].
  • camille
    camille over 2 years
    I'm a daily dplyr user but still wouldn't call it necessarily best practice, more like common personal preference
  • geotheory
    geotheory over 2 years
    You are perfectly right - dplyr isn't best for all cases, e.g. data.table or poorman might be preferable. And what does 'best practice' mean anyway?