How to subset data.frame by weeks and then sum?

16,546

Solution 1

This solution is influenced by @Andrie and @Chase.

# load plyr 
library(plyr)

# format weeks as per requirement (replace "00" with "52" and adjust corresponding year)
tmp <- list()
tmp$y <- format(df$date, format="%Y")
tmp$w <- format(df$date, format="%U")
tmp$y[tmp$w=="00"] <- as.character(as.numeric(tmp$y[tmp$w=="00"]) - 1)
tmp$w[tmp$w=="00"] <- "52"
df$week <- paste(tmp$y, tmp$w, sep = "-")

# get summary
df2 <- ddply(df, .(week), summarize, income=sum(income))

# include week ending date
tmp$week.ending <- lapply(df2$week, function(x) rev(df[df$week==x, "date"])[[1]])
df2$week.ending <- sapply(tmp$week.ending, as.character)

#      week income week.ending
# 1 2010-52    487  2011-01-01
# 2 2011-01    387  2011-01-08
# 3 2011-02    443  2011-01-15

Solution 2

First use format to convert your dates to week numbers, then plyr::ddply() to calculate the summaries:

library(plyr)
df$week <- format(df$date, format="%Y-%U")
ddply(df, .(week), summarize, income=sum(income))
     week income
1 2011-52    413
2 2012-01    435
3 2012-02    379

For more information on format.date, see ?strptime, particular the bit that defines %U as the week number.


EDIT:

Given the modified data and requirement, one way is to divide the date by 7 to get a numeric number indicating the week. (Or more precisely, divide by the number of seconds in a week to get the number of weeks since the epoch, which is 1970-01-01 by default.

In code:

df$week <- as.Date("1970-01-01")+7*trunc(as.numeric(df$date)/(3600*24*7))
library(plyr)
ddply(df, .(week), summarize, income=sum(income))

        week income
1 2010-12-23    298
2 2010-12-30    392
3 2011-01-06    294
4 2011-01-13    152

I have not checked that the week boundaries are on Sunday. You will have to check this, and insert an appropriate offset into the formula.

Solution 3

This is now simple using dplyr. Also I would suggest using cut(breaks = "week") rather than format() to cut the dates into weeks.

library(dplyr)
df %>% group_by(week = cut(date, "week")) %>% mutate(weekly_income = sum(income))

Solution 4

I Googled "group week days into weeks R" and came across this SO question. You mention you have multiple years, so I think we need to keep up with both the week number and also the year, so I modified the answers there as so format(date, format = "%U%y")

In use it looks like this:

library(plyr) #for aggregating
df <- transform(df, weeknum = format(date, format = "%y%U"))
ddply(df, "weeknum", summarize, suminc = sum(income))
#----
  weeknum suminc
1    1152    413
2    1201    435
3    1202    379

See ?strptime for all the format abbreviations.

Solution 5

Try rollapply from the zoo package:

rollapply(df$income, width=7, FUN = sum, by = 7)
# [1] 487 387 443

Or, use period.sum from the xts package:

period.sum(xts(df$income, order.by=df$date), which(df$wday %in% 7))
#            [,1]
# 2011-01-01  487
# 2011-01-08  387
# 2011-01-15  443

Or, to get the output in the format you want:

data.frame(income = period.sum(xts(df$income, order.by=df$date), 
                               which(df$wday %in% 7)),
           week = df$week[which(df$wday %in% 7)])
#            income    week
# 2011-01-01    487 2011-00
# 2011-01-08    387 2011-01
# 2011-01-15    443 2011-02

Note that the first week shows as 2011-00 because that's how it is entered in your data. You could also use week = df$week[which(df$wday %in% 1)] which would match your output.

Share:
16,546
Tony Breyal
Author by

Tony Breyal

Updated on June 23, 2022

Comments

  • Tony Breyal
    Tony Breyal almost 2 years

    Let's say I have several years worth of data which look like the following

    # load date package and set random seed
    library(lubridate)
    set.seed(42)
    
    # create data.frame of dates and income
    date <- seq(dmy("26-12-2010"), dmy("15-01-2011"), by = "days")
    df <- data.frame(date = date, 
                     wday = wday(date),
                     wday.name = wday(date, label = TRUE, abbr = TRUE),
                     income = round(runif(21, 0, 100)),
                     week = format(date, format="%Y-%U"),
                     stringsAsFactors = FALSE)
    
    #          date wday wday.name income    week
    # 1  2010-12-26    1       Sun     91 2010-52
    # 2  2010-12-27    2       Mon     94 2010-52
    # 3  2010-12-28    3      Tues     29 2010-52
    # 4  2010-12-29    4       Wed     83 2010-52
    # 5  2010-12-30    5     Thurs     64 2010-52
    # 6  2010-12-31    6       Fri     52 2010-52
    # 7  2011-01-01    7       Sat     74 2011-00
    # 8  2011-01-02    1       Sun     13 2011-01
    # 9  2011-01-03    2       Mon     66 2011-01
    # 10 2011-01-04    3      Tues     71 2011-01
    # 11 2011-01-05    4       Wed     46 2011-01
    # 12 2011-01-06    5     Thurs     72 2011-01
    # 13 2011-01-07    6       Fri     93 2011-01
    # 14 2011-01-08    7       Sat     26 2011-01
    # 15 2011-01-09    1       Sun     46 2011-02
    # 16 2011-01-10    2       Mon     94 2011-02
    # 17 2011-01-11    3      Tues     98 2011-02
    # 18 2011-01-12    4       Wed     12 2011-02
    # 19 2011-01-13    5     Thurs     47 2011-02
    # 20 2011-01-14    6       Fri     56 2011-02
    # 21 2011-01-15    7       Sat     90 2011-02
    

    I would like to sum 'income' for each week (Sunday thru Saturday). Currently I do the following:

    Weekending 2011-01-01 = sum(df$income[1:7]) = 487
    Weekending 2011-01-08 = sum(df$income[8:14]) = 387
    Weekending 2011-01-15 = sum(df$income[15:21]) = 443
    

    However I would like a more robust approach which will automatically sum by week. I can't work out how to automatically subset the data into weeks. Any help would be much appreciated.

  • Tony Breyal
    Tony Breyal almost 12 years
    Almost there, please see edit in my question for an issue I spotted with this approach :)
  • Tony Breyal
    Tony Breyal almost 12 years
    Almost there, please see edit in my question for an issue I spotted with this approach :)
  • Chase
    Chase almost 12 years
    @TonyBreyal - looks like Andrie beat me to the punch and has a more robust answer. Also, Dwin's answer in the question I linked above is along these lines as well. Cheers.
  • Tony Breyal
    Tony Breyal almost 12 years
    thank you kindly. I came up with my own solution based around further reading of ?format and plyr, as you suggested. I changed the data.frame in my question to better reflect what I was after and I think that is why your results are not quite what I was looking for (after applying your code to the new data) but using your answer I was able to get the solution I needed (I have added it above). Thank you very much, I couldn't have done it without reading your approach :)
  • Tony Breyal
    Tony Breyal almost 12 years
    thank you kindly for the link. I saw it before but I couldn't work out how it would apply to my requirement of allowing a week to cross the year boundary. I have the solution now though and appreciate your help :)