Replacing NAs with latest non-NA value
Solution 1
You probably want to use the na.locf()
function from the zoo package to carry the last observation forward to replace your NA values.
Here is the beginning of its usage example from the help page:
library(zoo)
az <- zoo(1:6)
bz <- zoo(c(2,NA,1,4,5,2))
na.locf(bz)
1 2 3 4 5 6
2 2 1 4 5 2
na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6
2 1 1 4 5 2
cz <- zoo(c(NA,9,3,2,3,2))
na.locf(cz)
2 3 4 5 6
9 3 2 3 2
Solution 2
Sorry for digging up an old question. I couldn't look up the function to do this job on the train, so I wrote one myself.
I was proud to find out that it's a tiny bit faster.
It's less flexible though.
But it plays nice with ave
, which is what I needed.
repeat.before = function(x) { # repeats the last non NA value. Keeps leading NA
ind = which(!is.na(x)) # get positions of nonmissing values
if(is.na(x[1])) # if it begins with a missing, add the
ind = c(1,ind) # first position to the indices
rep(x[ind], times = diff( # repeat the values at these indices
c(ind, length(x) + 1) )) # diffing the indices + length yields how often
} # they need to be repeated
x = c(NA,NA,'a',NA,NA,NA,NA,NA,NA,NA,NA,'b','c','d',NA,NA,NA,NA,NA,'e')
xx = rep(x, 1000000)
system.time({ yzoo = na.locf(xx,na.rm=F)})
## user system elapsed
## 2.754 0.667 3.406
system.time({ yrep = repeat.before(xx)})
## user system elapsed
## 0.597 0.199 0.793
Edit
As this became my most upvoted answer, I was reminded often that I don't use my own function, because I often need zoo's maxgap
argument. Because zoo has some weird problems in edge cases when I use dplyr + dates that I couldn't debug, I came back to this today to improve my old function.
I benchmarked my improved function and all the other entries here. For the basic set of features, tidyr::fill
is fastest while also not failing the edge cases. The Rcpp entry by @BrandonBertelsen is faster still, but it's inflexible regarding the input's type (he tested edge cases incorrectly due to a misunderstanding of all.equal
).
If you need maxgap
, my function below is faster than zoo (and doesn't have the weird problems with dates).
I put up the documentation of my tests.
new function
repeat_last = function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {
if (!forward) x = rev(x) # reverse x twice if carrying backward
ind = which(!is.na(x)) # get positions of nonmissing values
if (is.na(x[1]) && !na.rm) # if it begins with NA
ind = c(1,ind) # add first pos
rep_times = diff( # diffing the indices + length yields how often
c(ind, length(x) + 1) ) # they need to be repeated
if (maxgap < Inf) {
exceed = rep_times - 1 > maxgap # exceeding maxgap
if (any(exceed)) { # any exceed?
ind = sort(c(ind[exceed] + 1, ind)) # add NA in gaps
rep_times = diff(c(ind, length(x) + 1) ) # diff again
}
}
x = rep(x[ind], times = rep_times) # repeat the values at these indices
if (!forward) x = rev(x) # second reversion
x
}
I've also put the function in my formr package (Github only).
Solution 3
a data.table
solution:
dt <- data.table(y = c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA))
dt[, y_forward_fill := y[1], .(cumsum(!is.na(y)))]
dt
y y_forward_fill
1: NA NA
2: 2 2
3: 2 2
4: NA 2
5: NA 2
6: 3 3
7: NA 3
8: 4 4
9: NA 4
10: NA 4
this approach could work with forward filling zeros as well:
dt <- data.table(y = c(0, 2, -2, 0, 0, 3, 0, -4, 0, 0))
dt[, y_forward_fill := y[1], .(cumsum(y != 0))]
dt
y y_forward_fill
1: 0 0
2: 2 2
3: -2 -2
4: 0 -2
5: 0 -2
6: 3 3
7: 0 3
8: -4 -4
9: 0 -4
10: 0 -4
this method becomes very useful on data at scale and where you would want to perform a forward fill by group(s), which is trivial with data.table
. just add the group(s) to the by
clause prior to the cumsum
logic.
dt <- data.table(group = sample(c('a', 'b'), 20, replace = TRUE), y = sample(c(1:4, rep(NA, 4)), 20 , replace = TRUE))
dt <- dt[order(group)]
dt[, y_forward_fill := y[1], .(group, cumsum(!is.na(y)))]
dt
group y y_forward_fill
1: a NA NA
2: a NA NA
3: a NA NA
4: a 2 2
5: a NA 2
6: a 1 1
7: a NA 1
8: a 3 3
9: a NA 3
10: a NA 3
11: a 4 4
12: a NA 4
13: a 1 1
14: a 4 4
15: a NA 4
16: a 3 3
17: b 4 4
18: b NA 4
19: b NA 4
20: b 2 2
Solution 4
You can use the data.table
function nafill
, available from data.table >= 1.12.3
.
library(data.table)
nafill(y, type = "locf")
# [1] NA 2 2 2 2 3 3 4 4 4
If your vector is a column in a data.table
, you can also update it by reference with setnafill
:
d <- data.table(x = 1:10, y)
setnafill(d, type = "locf", cols = "y")
d
# x y
# 1: 1 NA
# 2: 2 2
# 3: 3 2
# 4: 4 2
# 5: 5 2
# 6: 6 3
# 7: 7 3
# 8: 8 4
# 9: 9 4
# 10: 10 4
If you have NA
in several columns...
d <- data.table(x = c(1, NA, 2), y = c(2, 3, NA), z = c(4, NA, 5))
# x y z
# 1: 1 2 4
# 2: NA 3 NA
# 3: 2 NA 5
...you can fill them by reference in one go:
setnafill(d, type = "locf")
d
# x y z
# 1: 1 2 4
# 2: 1 3 4
# 3: 2 3 5
Note that:
Only double and integer data types are currently [
data.table 1.12.6
] supported.
The functionality will most likely soon be extended; see the open issue nafill, setnafill for character, factor and other types, where you also find a temporary workaround.
Solution 5
The tidyr
package (part of the tidyverse
suite of packages) has a simple way to do that:
y = c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
# first, transform it into a data.frame
df = as.data.frame(y)
y
1 NA
2 2
3 2
4 NA
5 NA
6 3
7 NA
8 4
9 NA
10 NA
library(tidyr)
fill(df, y, .direction = 'down')
y
1 NA
2 2
3 2
4 2
5 2
6 3
7 3
8 4
9 4
10 4
Related videos on Youtube
Ryogi
Updated on April 09, 2022Comments
-
Ryogi about 2 years
In a
data.frame
(ordata.table
), I would like to "fill forward" NAs with the closest previous non-NA value. A simple example, using vectors (instead of adata.frame
) is the following:> y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
I would like a function
fill.NAs()
that allows me to constructyy
such that:> yy [1] NA NA NA 2 2 2 2 3 3 3 4 4
I need to repeat this operation for many (total ~1 Tb) small sized
data.frame
s (~30-50 Mb), where a row is NA is all its entries are. What is a good way to approach the problem?The ugly solution I cooked up uses this function:
last <- function (x){ x[length(x)] } fill.NAs <- function(isNA){ if (isNA[1] == 1) { isNA[1:max({which(isNA==0)[1]-1},1)] <- 0 # first is NAs # can't be forward filled } isNA.neg <- isNA.pos <- isNA.diff <- diff(isNA) isNA.pos[isNA.diff < 0] <- 0 isNA.neg[isNA.diff > 0] <- 0 which.isNA.neg <- which(as.logical(isNA.neg)) if (length(which.isNA.neg)==0) return(NULL) # generates warnings later, but works which.isNA.pos <- which(as.logical(isNA.pos)) which.isNA <- which(as.logical(isNA)) if (length(which.isNA.neg)==length(which.isNA.pos)){ replacement <- rep(which.isNA.pos[2:length(which.isNA.neg)], which.isNA.neg[2:max(length(which.isNA.neg)-1,2)] - which.isNA.pos[1:max(length(which.isNA.neg)-1,1)]) replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos))) } else { replacement <- rep(which.isNA.pos[1:length(which.isNA.neg)], which.isNA.neg - which.isNA.pos[1:length(which.isNA.neg)]) replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos))) } replacement }
The function
fill.NAs
is used as follows:y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA) isNA <- as.numeric(is.na(y)) replacement <- fill.NAs(isNA) if (length(replacement)){ which.isNA <- which(as.logical(isNA)) to.replace <- which.isNA[which(isNA==0)[1]:length(which.isNA)] y[to.replace] <- y[replacement] }
Output
> y [1] NA 2 2 2 2 3 3 3 4 4 4
... which seems to work. But, man, is it ugly! Any suggestions?
-
Matt Dowle over 12 yearsFrom other questions since this one, I think you've now found
roll=TRUE
indata.table
. -
Saksham over 8 yearsA new method is being introduced as
fill
inR
-
zx8754 about 8 yearsAlso, look into
tidyr::fill()
. -
Michael Ohlrogge over 5 yearsSee also: stackoverflow.com/questions/12607465/…
-
-
Zhubarb over 9 years+1, but I am guessing this needs to be looped per column if you want to apply this to a
df
with multiple columns? -
G. Grothendieck over 7 yearsAlso note that
na.locf
in zoo works with ordinary vectors as well as zoo objects. Itsna.rm
argument can be useful in some applications. -
xclotet over 7 yearsAn lapply can be added so it can directly apply it to multiple NA columns:
replaceNaWithLatest <- function( dfIn, nameColsNa = names(dfIn)[1] ){ dtTest <- data.table(dfIn) invisible(lapply(nameColsNa, function(nameColNa){ setnames(dtTest, nameColNa, "colNa") dtTest[, segment := cumsum(!is.na(colNa))] dtTest[, colNa := colNa[1], by = "segment"] dtTest[, segment := NULL] setnames(dtTest, "colNa", nameColNa) })) return(dtTest) }
-
Ruben over 7 yearsThis function doesn't do what you expect when there are leading NAs.
replace_na_with_last(c(NA,1:4,NA))
(i.e. they're filled with the following value). This is also the default behaviour ofimputeTS::na.locf(x, na.remaining = "rev")
. -
Achim Zeileis about 7 years@Ruben Thanks again for your report. By now the bug is fixed on R-Forge. Also I have tweaked and exported the workhorse function
na.locf0
which is now similar in scope and performance to yourrepeat_last
function. The clue was to usediff
rather thancumsum
and avoidifelse
. The mainna.locf.default
function is still somewhat slower because it does some more checks and handles multiple columns etc. -
Steffen Moritz about 7 yearsint n = x.size() and for(int i = 0; i<n; i++) should be replaced by double. In R an vector can be larger than c++ int size.
-
Evan Cortens about 7 yearsIt looks like this function returns "R_xlen_t". If R is compiled with long vector support, this is defined as ptrdiff_t; if it isn't, it's an int. Thanks for the correction!
-
Gregor Thomas about 7 yearsThis doesn't work if there are multiple consecutive missing values -
1 NA NA
turns into1 1 NA
. Also, I think theas.array()
is unnecessary. -
ABT about 6 yearsthis is rather elegant. Not sure if it works in all cases but it sure worked for me!
-
BallpointBen almost 6 yearsUse
na.locf(cz, na.rm=FALSE)
to keep leadingNA
. -
Artem Klevtsov almost 6 yearsTo improve it you can add this:
if (!anyNA(x)) return(x)
. -
Hack-R almost 6 yearsAt first I was excited by this solution, but it's actually not doing the same thing at all. The question is about filling in 1 data set with another. This answer is just imputation.
-
Nick Nassuphis over 4 yearsbetter to add a default for this case, slightly different approach:
replace_na_with_last<-function(x,p=is.na,d=0)c(d,x)[cummax(seq_along(x)*(!p(x)))+1]
-
Ben about 4 years@BallpointBen 's comment is important and should be included in the answer. Thanks!
-
JCWong about 4 yearsThe ability to do this by groups is awesome!
-
Kim almost 4 years@NickNassuphis 's answer is short, sweet, not package-dependent, and works well with dplyr pipes!
-
Desmond over 3 yearsI'm familiar with tidyverse but new to data.table - can I ask you what this does? dt[, y_forward_fill := y[1], .(cumsum(!is.na(y)))] Specifically, y[1] and why .(cumsum(!is.na(y))) forward fills the NAs?
-
Tony DiFranco over 3 yearsdata.table syntax is predicated on the form of dt[i, j, by]. The intro vignette is very good. It does takes some getting used to if you are coming from the pipe world.
-
AnilGoyal about 3 yearsThe downside of this function is that, first the atomic vector has to be created
as.data.frame()
and that the output is also adata.frame
instead of atomic vector -
Jantje Houten almost 3 yearsHI @TonyDiFranco, how would you suggest someone implement this if the intention is to fill backwards?
-
Tony DiFranco almost 3 years@JantjeHouten the simplest, though not most efficient, approach would be to reverse the sort order of the data.table, perform a forward fill as indicated, and then reverse once more back to the original order
-
Julien almost 2 years@AnilGoyal It's an upside for my case