Create a panel data frame

13,741

Solution 1

1) reshape2 Create a grid g of all years and id values crossed and rbind it with frame.

Then using the reshape2 package cast frame from long to wide form and then melt it back to long form. Finally rearrange the rows and columns as desired.

The lines ending in one # are only to ensure that every year is present so if we knew that were the case those lines could be omitted. The line ending in ## is only to rearrange the rows and columns so if that did not matter that line could be omitted too.

library(reshape2)

g <- with(frame, expand.grid(year = seq(min(year), max(year)), id = unique(id), y = 0)) #
frame <- rbind(frame, g) #

wide <- dcast(frame, year ~ id, fill = 0, fun = sum, value.var = "y")
long <- melt(wide, id = "year", variable.name = "id", value.name = "y")

long <- long[order(long$id, long$year), c("id", "year", "y")] ##

giving:

> long
   id year y
1   1 2005 1
2   1 2006 0
3   1 2007 0
4   1 2008 0
5   2 2005 0
6   2 2006 0
7   2 2007 0
8   2 2008 0
9   3 2005 0
10  3 2006 0
11  3 2007 0
12  3 2008 0
13  4 2005 0
14  4 2006 0
15  4 2007 1
16  4 2008 0

2) aggregate A shorter solution would be to run just the two lines that end with # above and then follow those with an aggregate as shown. This solution uses no addon packages.

g <- with(frame, expand.grid(year = seq(min(year), max(year)), id = unique(id), y = 0)) #
frame <- rbind(frame, g) # 

aggregate(y ~ year + id, frame, sum)[c("id", "year", "y")]

This gives the same answer as solution (1) except as noted by a commenter solution (1) above makes id a factor whereas it is not in this solution.

Solution 2

Using data.table:

require(data.table)
DT <- data.table(frame, key=c("id", "year"))
comb <- CJ(1:4, 2005:2008) # like 'expand.grid', but faster + sets key
ans <- DT[comb][is.na(y), y:=0L] # perform a join (DT[comb]), then set NAs to 0
#     id year y
#  1:  1 2005 1
#  2:  1 2006 0
#  3:  1 2007 0
#  4:  1 2008 0
#  5:  2 2005 0
#  6:  2 2006 0
#  7:  2 2007 0
#  8:  2 2008 0
#  9:  3 2005 0
# 10:  3 2006 0
# 11:  3 2007 0
# 12:  3 2008 0
# 13:  4 2005 0
# 14:  4 2006 0
# 15:  4 2007 1
# 16:  4 2008 0
Share:
13,741

Related videos on Youtube

coding_heart
Author by

coding_heart

Updated on July 04, 2022

Comments

  • coding_heart
    coding_heart almost 2 years

    I would like to create a panel from a dataset that has one observation for every given time period such that every unit has a new observation for every time period. Using the following example:

    id <- seq(1:4)
    year <- c(2005, 2008, 2008, 2007)
    y <- c(1,0,0,1)
    frame <- data.frame(id, year, y)
    frame
    
     id year y
    1  1 2005 1
    2  2 2008 0
    3  3 2008 0
    4  4 2007 1
    

    For each unique ID, I would like there to be a unique observation for the year 2005, 2006, 2007, and 2008 (the lower and upper time periods on this frame), and set the outcome y to 0 for all the times in which there isn't an existing observation, such that the new frame looks like:

     id year y
    1  1 2005 1
    2  1 2006 0
    3  1 2007 0
    4  1 2008 0
    ....
    13  4 2005 0
    14  4 2006 0
    15  4 2007 1
    16  4 2008 0
    

    I haven't had much success with loops; Any and all thoughts would be greatly appreciated.

  • Arun
    Arun over 10 years
    Any particular reason for reshaping here? Wouldn't this suffice? merge(frame, g, by=c("id", "year"), all=TRUE) (where g doesn't have y=0)
  • G. Grothendieck
    G. Grothendieck over 10 years
    Not if you want to be able to omit the computation of g if it were known that all years were present.
  • Arun
    Arun over 10 years
    I don't think I get it, sorry. Where do you eliminate computation of g in your case when it's known all years are present?
  • G. Grothendieck
    G. Grothendieck over 10 years
    In the first solution just omit the lines that end in # as stated in the answser and you will see it still gives the same result provided all years are present in the input frame.
  • Arun
    Arun over 10 years
    got it. Note that id will be a factor here.