Replace NA with previous or next value, by group, using dplyr

41,348

Solution 1

library(tidyr) #fill is part of tidyr

ps1 %>% 
  group_by(userID) %>% 
  #fill(color, age, gender) %>% #default direction down
  fill(color, age, gender, .direction = "downup")

Which gives you:

Source: local data frame [9 x 4]
Groups: userID [3]

  userID  color    age gender
   <dbl> <fctr> <fctr> <fctr>
1     21   blue   3yrs      F
2     21   blue   2yrs      F
3     21    red   2yrs      M
4     22   blue   3yrs      F
5     22   blue   3yrs      F
6     22   blue   3yrs      F
7     23    red   4yrs      F
8     23    red   4yrs      F
9     23   gold   4yrs      F

Solution 2

I wrote this function and it is definitely faster than fill and probably faster than na.locf:

fill_NA <- function(x) {
  which.na <- c(which(!is.na(x)), length(x) + 1)
  values <- na.omit(x)

  if (which.na[1] != 1) {
    which.na <- c(1, which.na)
    values <- c(values[1], values)
  }

  diffs <- diff(which.na)
  return(rep(values, times = diffs))
}

Solution 3

Using zoo::na.locf directly on the whole data.frame would fill the NA regardless of the userID groups. Package dplyr's grouping has unfortunately no effect on na.locf function, that's why I went with a split:

library(dplyr); library(zoo)
ps1 %>% split(ps1$userID) %>% 
  lapply(function(x) {na.locf(na.locf(x), fromLast=T)}) %>% 
  do.call(rbind, .)
####      userID color  age gender
#### 21.1     21  blue 3yrs      F
#### 21.2     21  blue 2yrs      F
#### 21.3     21   red 2yrs      M
#### 22.4     22  blue 3yrs      F
#### 22.5     22  blue 3yrs      F
#### 22.6     22  blue 3yrs      F
#### 23.7     23   red 4yrs      F
#### 23.8     23   red 4yrs      F
#### 23.9     23  gold 4yrs      F

What it does is that it first splits the data into 3 data.frames, then I apply a first pass of imputation (downwards), then upwards with the anonymous function in lapply, and eventually use rbind to bring the data.frames back together. You have the expected output.

Solution 4

Using @agenis method with na.locf() combined with purrr, you could do:

library(purrr)
library(zoo)

ps1 %>% 
  slice_rows("userID") %>% 
  by_slice(function(x) { 
    na.locf(na.locf(x), fromLast=T) }, 
    .collate = "rows") 

Solution 5

A few years down the line, I found that things have changed. Using @Steven Beaupré's approach,

1) Adding na.rm=F ensures no rows are deleted/excluded. 2) The slide_rows() function can be found in the purrrlyr package.

library(purrrlyr)
library(zoo)

ps1 %>% 
  slice_rows("userID") %>% 
  by_slice(function(x) { 
    na.locf(na.locf(x, na.rm=F), fromLast=T, na.rm=F) }, 
    .collate = "rows") 
Share:
41,348

Related videos on Youtube

Tarak
Author by

Tarak

Updated on March 03, 2020

Comments

  • Tarak
    Tarak about 4 years

    I have a data frame which is arranged by descending order of date.

    ps1 = data.frame(userID = c(21,21,21,22,22,22,23,23,23), 
                 color = c(NA,'blue','red','blue',NA,NA,'red',NA,'gold'), 
                 age = c('3yrs','2yrs',NA,NA,'3yrs',NA,NA,'4yrs',NA), 
                 gender = c('F',NA,'M',NA,NA,'F','F',NA,'F') 
    )
    

    I wish to impute(replace) NA values with previous values and grouped by userID In case the first row of a userID has NA then replace with the next set of values for that userid group.

    I am trying to use dplyr and zoo packages something like this...but its not working

    cleanedFUG <- filteredUserGroup %>%
     group_by(UserID) %>%
     mutate(Age1 = na.locf(Age), 
         Color1 = na.locf(Color), 
         Gender1 = na.locf(Gender) ) 
    

    I need result df like this:

                          userID color  age gender
                    1     21  blue 3yrs      F
                    2     21  blue 2yrs      F
                    3     21   red 2yrs      M
                    4     22  blue 3yrs      F
                    5     22  blue 3yrs      F
                    6     22  blue 3yrs      F
                    7     23   red 4yrs      F
                    8     23   red 4yrs      F
                    9     23  gold 4yrs      F
    
  • Steven Beaupré
    Steven Beaupré over 7 years
    You could replace do.call() with the more idiomatic bind_rows() and split(ps1$userID) with split(.$userID)
  • Steven Beaupré
    Steven Beaupré over 7 years
    Another alternative using purrr could also be: library(purrr); ps1 %>% slice_rows("userID") %>% by_slice(function(x) { na.locf(na.locf(x), fromLast=T) }, .collate = "rows")
  • agenis
    agenis over 7 years
    @StevenBeaupré nice! that deserves a new answer on its own ;-)
  • Ritchie Sacramento
    Ritchie Sacramento about 4 years
    The fill() function has since been updated to allow filling in both directions at once instead of needing to use it twice. The .direction options now include downup and updown.
  • Johannes Ranke
    Johannes Ranke over 3 years
    Thank you for sharing your code which is independent of a grouping variable (which I do not have in my data) and does not require installing the zoo package.
  • Johannes Ranke
    Johannes Ranke over 3 years
    Except that I am using tidyr anyways, and I discovered that tidyr::fill does the job more elegantly using the pipe, without the need to use mutate().
  • Amit
    Amit about 3 years
    what if your data set is huge and you cant write each columns name
  • Pake
    Pake over 2 years
    @Amit You can use fill(everything()) to fill all columns