Remove columns from dataframe where ALL values are NA

r apply dataframe

143,939

Solution 1

Try this:

df <- df[,colSums(is.na(df))<nrow(df)]

Solution 2

The two approaches offered thus far fail with large data sets as (amongst other memory issues) they create is.na(df), which will be an object the same size as df.

Here are two approaches that are more memory and time efficient

An approach using Filter

Filter(function(x)!all(is.na(x)), df)

and an approach using data.table (for general time and memory efficiency)

library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]

examples using large data (30 columns, 1e6 rows)

big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F)
bd <- do.call(data.frame,big_data)
names(bd) <- paste0('X',seq_len(30))
DT <- as.data.table(bd)

system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]})
# error -- can't allocate vector of size ...
system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]})
# error -- can't allocate vector of size ...
system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)})
## user  system elapsed 
## 0.26    0.03    0.29 
system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]})
## user  system elapsed 
## 0.14    0.03    0.18

Solution 3

Update

You can now use select with the where selection helper. select_if is superceded, but still functional as of dplyr 1.0.2. (thanks to @mcstrother for bringing this to attention).

library(dplyr)
temp <- data.frame(x = 1:5, y = c(1,2,NA,4, 5), z = rep(NA, 5))
not_all_na <- function(x) any(!is.na(x))
not_any_na <- function(x) all(!is.na(x))

> temp
  x  y  z
1 1  1 NA
2 2  2 NA
3 3 NA NA
4 4  4 NA
5 5  5 NA

> temp %>% select(where(not_all_na))
  x  y
1 1  1
2 2  2
3 3 NA
4 4  4
5 5  5

> temp %>% select(where(not_any_na))
  x
1 1
2 2
3 3
4 4
5 5

Old Answer

dplyr now has a select_if verb that may be helpful here:

> temp
  x  y  z
1 1  1 NA
2 2  2 NA
3 3 NA NA
4 4  4 NA
5 5  5 NA

> temp %>% select_if(not_all_na)
  x  y
1 1  1
2 2  2
3 3 NA
4 4  4
5 5  5

> temp %>% select_if(not_any_na)
  x
1 1
2 2
3 3
4 4
5 5

Solution 4

Late to the game but you can also use the janitor package. This function will remove columns which are all NA, and can be changed to remove rows that are all NA as well.

df <- janitor::remove_empty(df, which = "cols")

Solution 5

Another way would be to use the apply() function.

If you have the data.frame

df <- data.frame (var1 = c(1:7,NA),
                  var2 = c(1,2,1,3,4,NA,NA,9),
                  var3 = c(NA)
                  )

then you can use apply() to see which columns fulfill your condition and so you can simply do the same subsetting as in the answer by Musa, only with an apply approach.

> !apply (is.na(df), 2, all)
 var1  var2  var3 
 TRUE  TRUE FALSE 

> df[, !apply(is.na(df), 2, all)]
  var1 var2
1    1    1
2    2    2
3    3    1
4    4    3
5    5    4
6    6   NA
7    7   NA
8   NA    9

View more solutions

143,939

Gnark

Updated on November 06, 2021

Comments

Gnark over 2 years

I'm having trouble with a data frame and couldn't really resolve that issue myself:
The dataframe has arbitrary properties as columns and each row represents one data set.
The question is:
How to get rid of columns where for ALL rows the value is NA?
Darren Cook about 12 years

I expected this to be quicker, as the colSum() solution seemed to be doing more work. But on my test set (213 obs. of 1614 variables before, vs. 1377 variables afterwards) it takes exactly 3 times longer. (But +1 for an interesting approach.)
Matt Dowle over 11 years

Very nice. You could do the same with data.frame, though. There's nothing here that really needs data.table. The key is the lapply, which avoids the copy of the whole object done by is.na(df). +10 for pointing that out.
s_a almost 10 years

How would you do it with a data.frame? @matt-dowle
mnel almost 10 years

@s_a, bd1 <- bd[, unlist(lapply(bd, function(x), !all(is.na(x))))]
Thieme Hennis over 9 years

@mnel I think you need to remove the , after function(x) - thanks for the example btw
mtelesha over 8 years

This creates an object the size of the old object which is a problem with memory on large objects. Better to use a function to reduce the size. The answer bellow using Filter or using data.table will help your memory usage.
skan almost 8 years

Can you do it faster with := or with a set() ?
verbamour over 7 years

This does not appear to work with non-numeric columns.
Peter.k almost 6 years

It changes column name if they are duplicated
Andrew Brēza almost 5 years

Came here looking for the dplyr solution. Was not disappointed. Thanks!
MBorg almost 4 years

I found this had the issue that it would also delete variables with most but not all values as missing
jeromeResearch over 3 years

To do this with non-numeric columns, @mnel's solution using Filter() is a good one. A benchmark of multiple approaches can be found in this post
mcstrother over 3 years

select_if is now superseded in dplyr, so the last two lines would be temp %>% select(where(not_all_na)) in the most recent syntax -- although select_if still works for now as of dplyr 1.0.2. Also temp %>% select(where(~!all(is.na(.x)))) works if you don't feel like defining the function on a separate line.
zack over 3 years

@mcstrother thank you - that is a very helpful update to my answer. If you'd like to answer it yourself I'll happily roll back the edits.
Thomas Moore over 2 years

janitor::remove_empty() would be more appropriate here. ?remove_empty = "Remove empty rows and/or columns from a data.frame or matrix"
Sky Scraper over 2 years

not_any_na is not found for me. where does this come from? I have dplyr loaded.....
zack over 2 years

@SkyScraper it's a function defined in the code provided
gaspar almost 2 years

Doesn't seem to work with single-row data frames.