R remove rows from panel while keeping the panel balanced

16,495

Solution 1

One way to balance a panel is to remove individuals with incomplete data, another way is to fill in a value, such as NA or 0 for the missing observations. For the first approach, you can use complete.cases to find rows that have no NA in them. Then you can find all the PERSON with at least one missing case.

missing.at.least.one <- unique(unbal$PERSON[!complete.cases(unbal)])
unbal[!(unbal$PERSON %in% missing.at.least.one),]
#    PERSON YEAR  Y  X
# 1   Frank 2001 21  1
# 2   Frank 2002 22  2
# 3   Frank 2003 23  3
# 4   Frank 2004 24  4
# 5   Frank 2005 25  5
# 11 Edward 2001 31 11
# 12 Edward 2002 32 12
# 13 Edward 2003 33 13
# 14 Edward 2004 34 14
# 15 Edward 2005 35 15

Solution 2

So i'm not sure it satisfies the "elegant" requirement, but here's a general purpose function you can use to get balanced data.

balanced<-function(data, ID, TIME, VARS, required=c("all","shared")) {
    if(is.character(ID)) {
        ID <- match(ID, names(data))
    }
    if(is.character(TIME)) {
        TIME <- match(TIME, names(data))
    }
    if(missing(VARS)) { 
        VARS <- setdiff(1:ncol(data), c(ID,TIME))
    } else if (is.character(VARS)) {
        VARS <- match(VARS, names(data))
    }
    required <- match.arg(required)
    idf <- do.call(interaction, c(data[, ID, drop=FALSE], drop=TRUE))
    timef <- do.call(interaction, c(data[, TIME, drop=FALSE], drop=TRUE))
    complete <- complete.cases(data[, VARS])
    tbl <- table(idf[complete], timef[complete])
    if (required=="all") {
        keep <- which(rowSums(tbl==1)==ncol(tbl))
        idx <- as.numeric(idf) %in% keep
    } else if (required=="shared") {
        keep <- which(colSums(tbl==1)==nrow(tbl))
        idx <- as.numeric(timef) %in% keep
    }
    data[idx, ]
}

You can get your desired result with

balanced(unbal, "PERSON","YEAR")

#    PERSON YEAR  Y  X
# 1   Frank 2001 21  1
# 2   Frank 2002 22  2
# 3   Frank 2003 23  3
# 4   Frank 2004 24  4
# 5   Frank 2005 25  5
# 11 Edward 2001 31 11
# 12 Edward 2002 32 12
# 13 Edward 2003 33 13
# 14 Edward 2004 34 14
# 15 Edward 2005 35 15

The first parameter is the data.frame you wish to subset. The second parameter (ID=) is a character vector of column names that identify each "person" in the data set. Then the TIME= parameter is also a character vector specifying the different observation times for each ID. Finally, you can optionally specify a VARS= argument to specify which fields must be NA (defaults to all other than ID or TIME values). Finally, there is one last parameter named required which states whether each ID must have an observation for every TIME (default) or if you set it to "shared", it will only return the TIMES that all IDs have non-missing values for.

So for example

balanced(unbal, "PERSON","YEAR", "X")

#    PERSON YEAR  Y  X
# 1   Frank 2001 21  1
# 2   Frank 2002 22  2
# 3   Frank 2003 23  3
# 4   Frank 2004 24  4
# 5   Frank 2005 25  5
# 6    Tony 2001  5  6
# 7    Tony 2002  6  7
# 8    Tony 2003 NA  8
# 9    Tony 2004  7  9
# 10   Tony 2005  8 10
# 11 Edward 2001 31 11
# 12 Edward 2002 32 12
# 13 Edward 2003 33 13
# 14 Edward 2004 34 14
# 15 Edward 2005 35 15

only requires that "X" be NA for all PERSON/YEARS and since this is true for all records, no sub setting takes place.

If you do

balanced(unbal, "PERSON","YEAR", required="shared")

#    PERSON YEAR  Y  X
# 1   Frank 2001 21  1
# 2   Frank 2002 22  2
# 4   Frank 2004 24  4
# 5   Frank 2005 25  5
# 6    Tony 2001  5  6
# 7    Tony 2002  6  7
# 9    Tony 2004  7  9
# 10   Tony 2005  8 10
# 11 Edward 2001 31 11
# 12 Edward 2002 32 12
# 14 Edward 2004 34 14
# 15 Edward 2005 35 15

then you get the data for years 2001, 2002, 2004, 2005 for ALL persons since they all have data for those years.

Now let use create a slightly different sample data set

unbal2 <- unbal 
unbal2[15, 2] <- 2006
tail(unbal2)

#    PERSON YEAR  Y  X
# 10   Tony 2005  8 10
# 11 Edward 2001 31 11
# 12 Edward 2002 32 12
# 13 Edward 2003 33 13
# 14 Edward 2004 34 14
# 15 Edward 2006 35 15

Notice now that Edward is the only person that has a value for 2006. This means that

balanced(unbal2, "PERSON","YEAR")
# [1] PERSON YEAR   Y      X     
# <0 rows> (or 0-length row.names)

now returns nothing but

balanced(unbal2, "PERSON","YEAR", required="shared")

#    PERSON YEAR  Y  X
# 1   Frank 2001 21  1
# 2   Frank 2002 22  2
# 4   Frank 2004 24  4
# 6    Tony 2001  5  6
# 7    Tony 2002  6  7
# 9    Tony 2004  7  9
# 11 Edward 2001 31 11
# 12 Edward 2002 32 12
# 14 Edward 2004 34 14

will return the data for 2001,2002, 2004 since all persons have data for those years.

Solution 3

A solution I've used is to temporarily reshape the data frame into wide format with years as columns and units as rows, and then check for complete cases by row. This is easiest to do if you have a single variable of interest that--if missing--means the entire observation is missing.

I use the following libraries:

library(data.table)
library(reshape2)

First, take a subset of your main data frame (unbal) that is just, the ID variable ("NAME"), the time variable ("YEAR"), and a variable of interest ("X" or "Y").

df<- unbal[c("NAME", "YEAR", "X" )]

Second, reshape the new data frame to make it wide format. This makes a data frame in which each "NAME" is a single row, and "X" for each year is a column.

df <- dcast(df, NAME ~ YEAR, value.var = "X")

Third, run complete.cases for each row. Any NAME with missing data will be entirely removed.

df <- df[complete.cases(df),]

Fourth, reshape the data frame back into long format (by default, this gives your variables generic names, so you may want to change the names back to what they were before).

df <- melt(df, id.vars = "ID")
setnames(df, "variable", "YEAR")

NOTE: YEAR becomes a factor variable by default using the approach. If your YEAR variable is numeric, you'll want to change the variable accordingly. For example:

test4$year <- as.character(test4$year)
test4$year <- as.numeric(test4$year)

Fifth and sixth, take only the "NAME" and "YEAR" variables in the data frame you created, and then merge it with your original data frame (and be sure to drop cases in the original data frame that aren't found in the d data frame you created)

df <- df[c("NAME", "YEAR")]
balanced <- merge.data.frame(df, unbal, by = c("NAME", "YEAR"), all.x = TRUE)

Solution 4

This is the solution that I use - it makes use of the convenient features (including nice merging abilities) of the data.table package and assumes that your data is already a data.table object. It is relatively simple and hopefully easy to follow. It returns a balanced panel with entries for every unique combination of the "individuals" and the "time periods" - i.e. a panel where there is an observation for every individual in every time period.

library(data.table)
Balance_Panel = function(Data, Indiv_ColName, Time_ColName){
    Individuals = unique(Data[, get(Indiv_ColName)])
    Times = unique(Data[, get(Time_ColName)])

    Full_Panel = data.table(expand.grid(Individuals, Times))
    setnames(Full_Panel, c(Indiv_ColName, Time_ColName))
    setkeyv(Full_Panel, c(Indiv_ColName, Time_ColName))
    setkeyv(Data, c(Indiv_ColName, Time_ColName))
    return(Data[Full_Panel])
}

Sample Usage:

Balanced_Data = Balance_Panel(Data, "SubjectID", "ObservationTime")
Share:
16,495
user1491868
Author by

user1491868

Updated on June 24, 2022

Comments

  • user1491868
    user1491868 almost 2 years

    Is there an elegant way to balance an unbalanced panel data set? I would like to start with an unbalanced panel (ie, some individuals are missing some data) and end up with a balanced panel (ie, all individuals are missing no data). Below is some sample code. The correct end result is for all observations on 'Frank' and 'Edward' to remain and for all observations on 'Tony' to be removed since he has some missing data. Thank you.

    unbal <- data.frame(PERSON=c(rep('Frank',5),rep('Tony',5),rep('Edward',5)), YEAR=c(2001,2002,2003,2004,2005,2001,2002,2003,2004,2005,2001,2002,2003,2004,2005), Y=c(21,22,23,24,25,5,6,NA,7,8,31,32,33,34,35), X=c(1:15))
    unbal
    
  • nograpes
    nograpes over 9 years
    Very nice. I admit to not having understood what was meant by "balanced" until I read about it. This is a much nicer general solution.
  • Helix123
    Helix123 almost 8 years
    If one is interested in the other way round (make an unbalanced panel balanced by filling in NAs), one could use the function make.pbalanced from package plm (latest development version required from r-forge.r-project.org/R/?group_id=406)
  • Helix123
    Helix123 over 7 years
    The official CRAN release (1.6-4) of plm now has make.pbalanced incorporated (and by argument balance.type = c("fill", "shared") one can choose whether to extend the data or to reduce them.