Remove columns from dataframe where ALL values are NA

143,939

Solution 1

Try this:

df <- df[,colSums(is.na(df))<nrow(df)]

Solution 2

The two approaches offered thus far fail with large data sets as (amongst other memory issues) they create is.na(df), which will be an object the same size as df.

Here are two approaches that are more memory and time efficient

An approach using Filter

Filter(function(x)!all(is.na(x)), df)

and an approach using data.table (for general time and memory efficiency)

library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]

examples using large data (30 columns, 1e6 rows)

big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F)
bd <- do.call(data.frame,big_data)
names(bd) <- paste0('X',seq_len(30))
DT <- as.data.table(bd)

system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]})
# error -- can't allocate vector of size ...
system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]})
# error -- can't allocate vector of size ...
system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)})
## user  system elapsed 
## 0.26    0.03    0.29 
system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]})
## user  system elapsed 
## 0.14    0.03    0.18 

Solution 3

Update

You can now use select with the where selection helper. select_if is superceded, but still functional as of dplyr 1.0.2. (thanks to @mcstrother for bringing this to attention).

library(dplyr)
temp <- data.frame(x = 1:5, y = c(1,2,NA,4, 5), z = rep(NA, 5))
not_all_na <- function(x) any(!is.na(x))
not_any_na <- function(x) all(!is.na(x))

> temp
  x  y  z
1 1  1 NA
2 2  2 NA
3 3 NA NA
4 4  4 NA
5 5  5 NA

> temp %>% select(where(not_all_na))
  x  y
1 1  1
2 2  2
3 3 NA
4 4  4
5 5  5

> temp %>% select(where(not_any_na))
  x
1 1
2 2
3 3
4 4
5 5

Old Answer

dplyr now has a select_if verb that may be helpful here:

> temp
  x  y  z
1 1  1 NA
2 2  2 NA
3 3 NA NA
4 4  4 NA
5 5  5 NA

> temp %>% select_if(not_all_na)
  x  y
1 1  1
2 2  2
3 3 NA
4 4  4
5 5  5

> temp %>% select_if(not_any_na)
  x
1 1
2 2
3 3
4 4
5 5

Solution 4

Late to the game but you can also use the janitor package. This function will remove columns which are all NA, and can be changed to remove rows that are all NA as well.

df <- janitor::remove_empty(df, which = "cols")

Solution 5

Another way would be to use the apply() function.

If you have the data.frame

df <- data.frame (var1 = c(1:7,NA),
                  var2 = c(1,2,1,3,4,NA,NA,9),
                  var3 = c(NA)
                  )

then you can use apply() to see which columns fulfill your condition and so you can simply do the same subsetting as in the answer by Musa, only with an apply approach.

> !apply (is.na(df), 2, all)
 var1  var2  var3 
 TRUE  TRUE FALSE 

> df[, !apply(is.na(df), 2, all)]
  var1 var2
1    1    1
2    2    2
3    3    1
4    4    3
5    5    4
6    6   NA
7    7   NA
8   NA    9
Share:
143,939

Related videos on Youtube

Gnark
Author by

Gnark

Updated on November 06, 2021

Comments

  • Gnark
    Gnark over 2 years

    I'm having trouble with a data frame and couldn't really resolve that issue myself:
    The dataframe has arbitrary properties as columns and each row represents one data set.

    The question is:
    How to get rid of columns where for ALL rows the value is NA?

  • Darren Cook
    Darren Cook about 12 years
    I expected this to be quicker, as the colSum() solution seemed to be doing more work. But on my test set (213 obs. of 1614 variables before, vs. 1377 variables afterwards) it takes exactly 3 times longer. (But +1 for an interesting approach.)
  • Matt Dowle
    Matt Dowle over 11 years
    Very nice. You could do the same with data.frame, though. There's nothing here that really needs data.table. The key is the lapply, which avoids the copy of the whole object done by is.na(df). +10 for pointing that out.
  • s_a
    s_a almost 10 years
    How would you do it with a data.frame? @matt-dowle
  • mnel
    mnel almost 10 years
    @s_a, bd1 <- bd[, unlist(lapply(bd, function(x), !all(is.na(x))))]
  • Thieme Hennis
    Thieme Hennis over 9 years
    @mnel I think you need to remove the , after function(x) - thanks for the example btw
  • mtelesha
    mtelesha over 8 years
    This creates an object the size of the old object which is a problem with memory on large objects. Better to use a function to reduce the size. The answer bellow using Filter or using data.table will help your memory usage.
  • skan
    skan almost 8 years
    Can you do it faster with := or with a set() ?
  • verbamour
    verbamour over 7 years
    This does not appear to work with non-numeric columns.
  • Peter.k
    Peter.k almost 6 years
    It changes column name if they are duplicated
  • Andrew Brēza
    Andrew Brēza almost 5 years
    Came here looking for the dplyr solution. Was not disappointed. Thanks!
  • MBorg
    MBorg almost 4 years
    I found this had the issue that it would also delete variables with most but not all values as missing
  • jeromeResearch
    jeromeResearch over 3 years
    To do this with non-numeric columns, @mnel's solution using Filter() is a good one. A benchmark of multiple approaches can be found in this post
  • mcstrother
    mcstrother over 3 years
    select_if is now superseded in dplyr, so the last two lines would be temp %>% select(where(not_all_na)) in the most recent syntax -- although select_if still works for now as of dplyr 1.0.2. Also temp %>% select(where(~!all(is.na(.x)))) works if you don't feel like defining the function on a separate line.
  • zack
    zack over 3 years
    @mcstrother thank you - that is a very helpful update to my answer. If you'd like to answer it yourself I'll happily roll back the edits.
  • Thomas Moore
    Thomas Moore over 2 years
    janitor::remove_empty() would be more appropriate here. ?remove_empty = "Remove empty rows and/or columns from a data.frame or matrix"
  • Sky Scraper
    Sky Scraper over 2 years
    not_any_na is not found for me. where does this come from? I have dplyr loaded.....
  • zack
    zack over 2 years
    @SkyScraper it's a function defined in the code provided
  • gaspar
    gaspar almost 2 years
    Doesn't seem to work with single-row data frames.