How to subset data.frame by weeks and then sum?

r datetime dataframe

16,546

Solution 1

This solution is influenced by @Andrie and @Chase.

# load plyr 
library(plyr)

# format weeks as per requirement (replace "00" with "52" and adjust corresponding year)
tmp <- list()
tmp$y <- format(df$date, format="%Y")
tmp$w <- format(df$date, format="%U")
tmp$y[tmp$w=="00"] <- as.character(as.numeric(tmp$y[tmp$w=="00"]) - 1)
tmp$w[tmp$w=="00"] <- "52"
df$week <- paste(tmp$y, tmp$w, sep = "-")

# get summary
df2 <- ddply(df, .(week), summarize, income=sum(income))

# include week ending date
tmp$week.ending <- lapply(df2$week, function(x) rev(df[df$week==x, "date"])[[1]])
df2$week.ending <- sapply(tmp$week.ending, as.character)

#      week income week.ending
# 1 2010-52    487  2011-01-01
# 2 2011-01    387  2011-01-08
# 3 2011-02    443  2011-01-15

Solution 2

First use format to convert your dates to week numbers, then plyr::ddply() to calculate the summaries:

library(plyr)
df$week <- format(df$date, format="%Y-%U")
ddply(df, .(week), summarize, income=sum(income))
     week income
1 2011-52    413
2 2012-01    435
3 2012-02    379

For more information on format.date, see ?strptime, particular the bit that defines %U as the week number.

EDIT:

Given the modified data and requirement, one way is to divide the date by 7 to get a numeric number indicating the week. (Or more precisely, divide by the number of seconds in a week to get the number of weeks since the epoch, which is 1970-01-01 by default.

In code:

df$week <- as.Date("1970-01-01")+7*trunc(as.numeric(df$date)/(3600*24*7))
library(plyr)
ddply(df, .(week), summarize, income=sum(income))

        week income
1 2010-12-23    298
2 2010-12-30    392
3 2011-01-06    294
4 2011-01-13    152

I have not checked that the week boundaries are on Sunday. You will have to check this, and insert an appropriate offset into the formula.

Solution 3

This is now simple using dplyr. Also I would suggest using cut(breaks = "week") rather than format() to cut the dates into weeks.

library(dplyr)
df %>% group_by(week = cut(date, "week")) %>% mutate(weekly_income = sum(income))

Solution 4

I Googled "group week days into weeks R" and came across this SO question. You mention you have multiple years, so I think we need to keep up with both the week number and also the year, so I modified the answers there as so format(date, format = "%U%y")

In use it looks like this:

library(plyr) #for aggregating
df <- transform(df, weeknum = format(date, format = "%y%U"))
ddply(df, "weeknum", summarize, suminc = sum(income))
#----
  weeknum suminc
1    1152    413
2    1201    435
3    1202    379

See ?strptime for all the format abbreviations.

Solution 5

Try rollapply from the zoo package:

rollapply(df$income, width=7, FUN = sum, by = 7)
# [1] 487 387 443

Or, use period.sum from the xts package:

period.sum(xts(df$income, order.by=df$date), which(df$wday %in% 7))
#            [,1]
# 2011-01-01  487
# 2011-01-08  387
# 2011-01-15  443

Or, to get the output in the format you want:

data.frame(income = period.sum(xts(df$income, order.by=df$date), 
                               which(df$wday %in% 7)),
           week = df$week[which(df$wday %in% 7)])
#            income    week
# 2011-01-01    487 2011-00
# 2011-01-08    387 2011-01
# 2011-01-15    443 2011-02

Note that the first week shows as 2011-00 because that's how it is entered in your data. You could also use week = df$week[which(df$wday %in% 1)] which would match your output.

View more solutions

16,546

Author by

Tony Breyal

Updated on June 23, 2022

Comments

Tony Breyal almost 2 years

Let's say I have several years worth of data which look like the following

# load date package and set random seed
library(lubridate)
set.seed(42)

# create data.frame of dates and income
date <- seq(dmy("26-12-2010"), dmy("15-01-2011"), by = "days")
df <- data.frame(date = date, 
                 wday = wday(date),
                 wday.name = wday(date, label = TRUE, abbr = TRUE),
                 income = round(runif(21, 0, 100)),
                 week = format(date, format="%Y-%U"),
                 stringsAsFactors = FALSE)

#          date wday wday.name income    week
# 1  2010-12-26    1       Sun     91 2010-52
# 2  2010-12-27    2       Mon     94 2010-52
# 3  2010-12-28    3      Tues     29 2010-52
# 4  2010-12-29    4       Wed     83 2010-52
# 5  2010-12-30    5     Thurs     64 2010-52
# 6  2010-12-31    6       Fri     52 2010-52
# 7  2011-01-01    7       Sat     74 2011-00
# 8  2011-01-02    1       Sun     13 2011-01
# 9  2011-01-03    2       Mon     66 2011-01
# 10 2011-01-04    3      Tues     71 2011-01
# 11 2011-01-05    4       Wed     46 2011-01
# 12 2011-01-06    5     Thurs     72 2011-01
# 13 2011-01-07    6       Fri     93 2011-01
# 14 2011-01-08    7       Sat     26 2011-01
# 15 2011-01-09    1       Sun     46 2011-02
# 16 2011-01-10    2       Mon     94 2011-02
# 17 2011-01-11    3      Tues     98 2011-02
# 18 2011-01-12    4       Wed     12 2011-02
# 19 2011-01-13    5     Thurs     47 2011-02
# 20 2011-01-14    6       Fri     56 2011-02
# 21 2011-01-15    7       Sat     90 2011-02

I would like to sum 'income' for each week (Sunday thru Saturday). Currently I do the following:

Weekending 2011-01-01 = sum(df$income[1:7]) = 487
Weekending 2011-01-08 = sum(df$income[8:14]) = 387
Weekending 2011-01-15 = sum(df$income[15:21]) = 443

However I would like a more robust approach which will automatically sum by week. I can't work out how to automatically subset the data into weeks. Any help would be much appreciated.

Tony Breyal almost 12 years

Almost there, please see edit in my question for an issue I spotted with this approach :)
Tony Breyal almost 12 years

Almost there, please see edit in my question for an issue I spotted with this approach :)
Chase almost 12 years

@TonyBreyal - looks like Andrie beat me to the punch and has a more robust answer. Also, Dwin's answer in the question I linked above is along these lines as well. Cheers.
Tony Breyal almost 12 years

thank you kindly. I came up with my own solution based around further reading of ?format and plyr, as you suggested. I changed the data.frame in my question to better reflect what I was after and I think that is why your results are not quite what I was looking for (after applying your code to the new data) but using your answer I was able to get the solution I needed (I have added it above). Thank you very much, I couldn't have done it without reading your approach :)
Tony Breyal almost 12 years

thank you kindly for the link. I saw it before but I couldn't work out how it would apply to my requirement of allowing a week to cross the year boundary. I have the solution now though and appreciate your help :)