Interpolate NA values

22,541

Solution 1

Using the zoo package:

library(zoo)
Cz <- zoo(C)
index(Cz) <- Cz[,1]
Cz_approx <- na.approx(Cz)

Solution 2

The proper way to do this statistically and still get valid confidence intervals is to use Multiple Imputation. See Rubin's classic book, and there's an excellent R package for this (mi).

Share:
22,541
hlovdal
Author by

hlovdal

Linux user since 1994. Main programming language: C. #SOreadytohelp (http://stackoverflow.com/10m)

Updated on February 10, 2020

Comments

  • hlovdal
    hlovdal over 4 years

    I have two set of samples that are time independent. I would like to merge them and calculate the missing values for the times where I do not have values of both. Simplified example:

    A <- cbind(time=c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
               Avalue=c(1, 2, 3, 2, 1, 2, 3, 2, 1, 2))
    B <- cbind(time=c(15, 30, 45, 60), Bvalue=c(100, 200, 300, 400))
    C <- merge(A,B, all=TRUE)
    
       time Avalue Bvalue
    1    10      1     NA
    2    15     NA    100
    3    20      2     NA
    4    30      3    200
    5    40      2     NA
    6    45     NA    300
    7    50      1     NA
    8    60      2    400
    9    70      3     NA
    10   80      2     NA
    11   90      1     NA
    12  100      2     NA
    

    By assuming linear change between each sample, it is possible to calculate the missing NA values. Intuitively it is easy to see that the A value at time 15 and 45 should be 1.5. But a proper calculation for B for instance at time 20 would be

    100 + (20 - 15) * (200 - 100) / (30 - 15)

    which equals 133.33333. The first parenthesis being the time between estimate time and the last sample available. The second parenthesis being the difference between the nearest samples. The third parenthesis being the time between the nearest samples.

    How can I use R to calculate the NA values?

  • hlovdal
    hlovdal over 12 years
    Fantastic. I do not quite understand what the index(Cz) <- Cz[,1] statement is doing, care to explain?
  • Anatoliy
    Anatoliy over 12 years
    By default, the na.approx() function uses the index(obj) as points between which to interpolate each column of the dataframe. Default index is 1:12, so I replaced it with actual time measurements using index(). However, if you would like to preserve the default index, you can invoke na.approx(Cz, x=Cz$time).
  • Carl Witthoft
    Carl Witthoft over 12 years
    library(zoo); ?index "Description: Generic functions for extracting the index of an object and replacing it." You're manipulating parts of a zoo object. Always a good idea to RTFM before asking questions.
  • G. Grothendieck
    G. Grothendieck over 12 years
    Note that converting the data frame to zoo could also be written as Cz <- read.zoo(C) which automatically assumes the first column holds the times. Also zoo's na.approx has a default method that works on ordinary vectors so even without converting C to zoo we could do this: C$Bvalue <- na.approx(C$Bvalue, C$time, na.rm = FALSE).
  • Roman Luštrik
    Roman Luštrik over 12 years
    Care to provide a citation for the Rubin paper?
  • Ari B. Friedman
    Ari B. Friedman over 12 years
    Can't find the paper. His book is classic as well; if I find the paper I'm thinking of later I'll edit further.
  • puslet88
    puslet88 about 9 years
    Might consider adding a na.fill(na.approx(Cz), "extend") around that command too, so leading and trailing NAs wouldn't cause extra difficulties.