Replacing NAs with latest non-NA value

119,961

Solution 1

You probably want to use the na.locf() function from the zoo package to carry the last observation forward to replace your NA values.

Here is the beginning of its usage example from the help page:

library(zoo)

az <- zoo(1:6)

bz <- zoo(c(2,NA,1,4,5,2))

na.locf(bz)
1 2 3 4 5 6 
2 2 1 4 5 2 

na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6 
2 1 1 4 5 2 

cz <- zoo(c(NA,9,3,2,3,2))

na.locf(cz)
2 3 4 5 6 
9 3 2 3 2 

Solution 2

Sorry for digging up an old question. I couldn't look up the function to do this job on the train, so I wrote one myself.

I was proud to find out that it's a tiny bit faster.
It's less flexible though.

But it plays nice with ave, which is what I needed.

repeat.before = function(x) {   # repeats the last non NA value. Keeps leading NA
    ind = which(!is.na(x))      # get positions of nonmissing values
    if(is.na(x[1]))             # if it begins with a missing, add the 
          ind = c(1,ind)        # first position to the indices
    rep(x[ind], times = diff(   # repeat the values at these indices
       c(ind, length(x) + 1) )) # diffing the indices + length yields how often 
}                               # they need to be repeated

x = c(NA,NA,'a',NA,NA,NA,NA,NA,NA,NA,NA,'b','c','d',NA,NA,NA,NA,NA,'e')  
xx = rep(x, 1000000)  
system.time({ yzoo = na.locf(xx,na.rm=F)})  
## user  system elapsed   
## 2.754   0.667   3.406   
system.time({ yrep = repeat.before(xx)})  
## user  system elapsed   
## 0.597   0.199   0.793   

Edit

As this became my most upvoted answer, I was reminded often that I don't use my own function, because I often need zoo's maxgap argument. Because zoo has some weird problems in edge cases when I use dplyr + dates that I couldn't debug, I came back to this today to improve my old function.

I benchmarked my improved function and all the other entries here. For the basic set of features, tidyr::fill is fastest while also not failing the edge cases. The Rcpp entry by @BrandonBertelsen is faster still, but it's inflexible regarding the input's type (he tested edge cases incorrectly due to a misunderstanding of all.equal).

If you need maxgap, my function below is faster than zoo (and doesn't have the weird problems with dates).

I put up the documentation of my tests.

new function

repeat_last = function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {
    if (!forward) x = rev(x)           # reverse x twice if carrying backward
    ind = which(!is.na(x))             # get positions of nonmissing values
    if (is.na(x[1]) && !na.rm)         # if it begins with NA
        ind = c(1,ind)                 # add first pos
    rep_times = diff(                  # diffing the indices + length yields how often
        c(ind, length(x) + 1) )          # they need to be repeated
    if (maxgap < Inf) {
        exceed = rep_times - 1 > maxgap  # exceeding maxgap
        if (any(exceed)) {               # any exceed?
            ind = sort(c(ind[exceed] + 1, ind))      # add NA in gaps
            rep_times = diff(c(ind, length(x) + 1) ) # diff again
        }
    }
    x = rep(x[ind], times = rep_times) # repeat the values at these indices
    if (!forward) x = rev(x)           # second reversion
    x
}

I've also put the function in my formr package (Github only).

Solution 3

a data.table solution:

dt <- data.table(y = c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA))
dt[, y_forward_fill := y[1], .(cumsum(!is.na(y)))]
dt
     y y_forward_fill
 1: NA             NA
 2:  2              2
 3:  2              2
 4: NA              2
 5: NA              2
 6:  3              3
 7: NA              3
 8:  4              4
 9: NA              4
10: NA              4

this approach could work with forward filling zeros as well:

dt <- data.table(y = c(0, 2, -2, 0, 0, 3, 0, -4, 0, 0))
dt[, y_forward_fill := y[1], .(cumsum(y != 0))]
dt
     y y_forward_fill
 1:  0              0
 2:  2              2
 3: -2             -2
 4:  0             -2
 5:  0             -2
 6:  3              3
 7:  0              3
 8: -4             -4
 9:  0             -4
10:  0             -4

this method becomes very useful on data at scale and where you would want to perform a forward fill by group(s), which is trivial with data.table. just add the group(s) to the by clause prior to the cumsum logic.

dt <- data.table(group = sample(c('a', 'b'), 20, replace = TRUE), y = sample(c(1:4, rep(NA, 4)), 20 , replace = TRUE))
dt <- dt[order(group)]
dt[, y_forward_fill := y[1], .(group, cumsum(!is.na(y)))]
dt
    group  y y_forward_fill
 1:     a NA             NA
 2:     a NA             NA
 3:     a NA             NA
 4:     a  2              2
 5:     a NA              2
 6:     a  1              1
 7:     a NA              1
 8:     a  3              3
 9:     a NA              3
10:     a NA              3
11:     a  4              4
12:     a NA              4
13:     a  1              1
14:     a  4              4
15:     a NA              4
16:     a  3              3
17:     b  4              4
18:     b NA              4
19:     b NA              4
20:     b  2              2

Solution 4

You can use the data.table function nafill, available from data.table >= 1.12.3.

library(data.table)
nafill(y, type = "locf")
# [1] NA  2  2  2  2  3  3  4  4  4

If your vector is a column in a data.table, you can also update it by reference with setnafill:

d <- data.table(x = 1:10, y)
setnafill(d, type = "locf", cols = "y")
d
#      x  y
#  1:  1 NA
#  2:  2  2
#  3:  3  2
#  4:  4  2
#  5:  5  2
#  6:  6  3
#  7:  7  3
#  8:  8  4
#  9:  9  4
# 10: 10  4

If you have NA in several columns...

d <- data.table(x = c(1, NA, 2), y = c(2, 3, NA), z = c(4, NA, 5))
#     x  y  z
# 1:  1  2  4
# 2: NA  3 NA
# 3:  2 NA  5

...you can fill them by reference in one go:

setnafill(d, type = "locf")
d
#    x y z
# 1: 1 2 4
# 2: 1 3 4
# 3: 2 3 5

Note that:

Only double and integer data types are currently [data.table 1.12.6] supported.

The functionality will most likely soon be extended; see the open issue nafill, setnafill for character, factor and other types, where you also find a temporary workaround.

Solution 5

The tidyr package (part of the tidyverse suite of packages) has a simple way to do that:

y = c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)

# first, transform it into a data.frame

df = as.data.frame(y)
   y
1  NA
2   2
3   2
4  NA
5  NA
6   3
7  NA
8   4
9  NA
10 NA

library(tidyr)
fill(df, y, .direction = 'down')
    y
1  NA
2   2
3   2
4   2
5   2
6   3
7   3
8   4
9   4
10  4
Share:
119,961

Related videos on Youtube

Ryogi
Author by

Ryogi

Updated on April 09, 2022

Comments

  • Ryogi
    Ryogi about 2 years

    In a data.frame (or data.table), I would like to "fill forward" NAs with the closest previous non-NA value. A simple example, using vectors (instead of a data.frame) is the following:

    > y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
    

    I would like a function fill.NAs() that allows me to construct yy such that:

    > yy
    [1] NA NA NA  2  2  2  2  3  3  3  4  4
    

    I need to repeat this operation for many (total ~1 Tb) small sized data.frames (~30-50 Mb), where a row is NA is all its entries are. What is a good way to approach the problem?

    The ugly solution I cooked up uses this function:

    last <- function (x){
        x[length(x)]
    }    
    
    fill.NAs <- function(isNA){
    if (isNA[1] == 1) {
        isNA[1:max({which(isNA==0)[1]-1},1)] <- 0 # first is NAs 
                                                  # can't be forward filled
    }
    isNA.neg <- isNA.pos <- isNA.diff <- diff(isNA)
    isNA.pos[isNA.diff < 0] <- 0
    isNA.neg[isNA.diff > 0] <- 0
    which.isNA.neg <- which(as.logical(isNA.neg))
    if (length(which.isNA.neg)==0) return(NULL) # generates warnings later, but works
    which.isNA.pos <- which(as.logical(isNA.pos))
    which.isNA <- which(as.logical(isNA))
    if (length(which.isNA.neg)==length(which.isNA.pos)){
        replacement <- rep(which.isNA.pos[2:length(which.isNA.neg)], 
                                    which.isNA.neg[2:max(length(which.isNA.neg)-1,2)] - 
                                    which.isNA.pos[1:max(length(which.isNA.neg)-1,1)])      
        replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
    } else {
        replacement <- rep(which.isNA.pos[1:length(which.isNA.neg)], which.isNA.neg - which.isNA.pos[1:length(which.isNA.neg)])     
        replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
    }
    replacement
    }
    

    The function fill.NAs is used as follows:

    y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
    isNA <- as.numeric(is.na(y))
    replacement <- fill.NAs(isNA)
    if (length(replacement)){
    which.isNA <- which(as.logical(isNA))
    to.replace <- which.isNA[which(isNA==0)[1]:length(which.isNA)]
    y[to.replace] <- y[replacement]
    } 
    

    Output

    > y
    [1] NA  2  2  2  2  3  3  3  4  4  4
    

    ... which seems to work. But, man, is it ugly! Any suggestions?

    • Matt Dowle
      Matt Dowle over 12 years
      From other questions since this one, I think you've now found roll=TRUE in data.table.
    • Saksham
      Saksham over 8 years
      A new method is being introduced as fill in R
    • zx8754
      zx8754 about 8 years
      Also, look into tidyr::fill().
    • Michael Ohlrogge
      Michael Ohlrogge over 5 years
  • Zhubarb
    Zhubarb over 9 years
    +1, but I am guessing this needs to be looped per column if you want to apply this to a df with multiple columns?
  • G. Grothendieck
    G. Grothendieck over 7 years
    Also note that na.locf in zoo works with ordinary vectors as well as zoo objects. Its na.rm argument can be useful in some applications.
  • xclotet
    xclotet over 7 years
    An lapply can be added so it can directly apply it to multiple NA columns: replaceNaWithLatest <- function( dfIn, nameColsNa = names(dfIn)[1] ){ dtTest <- data.table(dfIn) invisible(lapply(nameColsNa, function(nameColNa){ setnames(dtTest, nameColNa, "colNa") dtTest[, segment := cumsum(!is.na(colNa))] dtTest[, colNa := colNa[1], by = "segment"] dtTest[, segment := NULL] setnames(dtTest, "colNa", nameColNa) })) return(dtTest) }
  • Ruben
    Ruben over 7 years
    This function doesn't do what you expect when there are leading NAs. replace_na_with_last(c(NA,1:4,NA)) (i.e. they're filled with the following value). This is also the default behaviour of imputeTS::na.locf(x, na.remaining = "rev").
  • Achim Zeileis
    Achim Zeileis about 7 years
    @Ruben Thanks again for your report. By now the bug is fixed on R-Forge. Also I have tweaked and exported the workhorse function na.locf0 which is now similar in scope and performance to your repeat_last function. The clue was to use diff rather than cumsum and avoid ifelse. The main na.locf.default function is still somewhat slower because it does some more checks and handles multiple columns etc.
  • Steffen Moritz
    Steffen Moritz about 7 years
    int n = x.size() and for(int i = 0; i<n; i++) should be replaced by double. In R an vector can be larger than c++ int size.
  • Evan Cortens
    Evan Cortens about 7 years
    It looks like this function returns "R_xlen_t". If R is compiled with long vector support, this is defined as ptrdiff_t; if it isn't, it's an int. Thanks for the correction!
  • Gregor Thomas
    Gregor Thomas about 7 years
    This doesn't work if there are multiple consecutive missing values - 1 NA NA turns into 1 1 NA. Also, I think the as.array() is unnecessary.
  • ABT
    ABT about 6 years
    this is rather elegant. Not sure if it works in all cases but it sure worked for me!
  • BallpointBen
    BallpointBen almost 6 years
    Use na.locf(cz, na.rm=FALSE) to keep leading NA.
  • Artem Klevtsov
    Artem Klevtsov almost 6 years
    To improve it you can add this: if (!anyNA(x)) return(x).
  • Hack-R
    Hack-R almost 6 years
    At first I was excited by this solution, but it's actually not doing the same thing at all. The question is about filling in 1 data set with another. This answer is just imputation.
  • Nick Nassuphis
    Nick Nassuphis over 4 years
    better to add a default for this case, slightly different approach: replace_na_with_last<-function(x,p=is.na,d=0)c(d,x)[cummax(s‌​eq_along(x)*(!p(x)))‌​+1]
  • Ben
    Ben about 4 years
    @BallpointBen 's comment is important and should be included in the answer. Thanks!
  • JCWong
    JCWong about 4 years
    The ability to do this by groups is awesome!
  • Kim
    Kim almost 4 years
    @NickNassuphis 's answer is short, sweet, not package-dependent, and works well with dplyr pipes!
  • Desmond
    Desmond over 3 years
    I'm familiar with tidyverse but new to data.table - can I ask you what this does? dt[, y_forward_fill := y[1], .(cumsum(!is.na(y)))] Specifically, y[1] and why .(cumsum(!is.na(y))) forward fills the NAs?
  • Tony DiFranco
    Tony DiFranco over 3 years
    data.table syntax is predicated on the form of dt[i, j, by]. The intro vignette is very good. It does takes some getting used to if you are coming from the pipe world.
  • AnilGoyal
    AnilGoyal about 3 years
    The downside of this function is that, first the atomic vector has to be created as.data.frame() and that the output is also a data.frame instead of atomic vector
  • Jantje Houten
    Jantje Houten almost 3 years
    HI @TonyDiFranco, how would you suggest someone implement this if the intention is to fill backwards?
  • Tony DiFranco
    Tony DiFranco almost 3 years
    @JantjeHouten the simplest, though not most efficient, approach would be to reverse the sort order of the data.table, perform a forward fill as indicated, and then reverse once more back to the original order
  • Julien
    Julien almost 2 years
    @AnilGoyal It's an upside for my case