Remove columns from dataframe where ALL values are NA
Solution 1
Try this:
df <- df[,colSums(is.na(df))<nrow(df)]
Solution 2
The two approaches offered thus far fail with large data sets as (amongst other memory issues) they create is.na(df)
, which will be an object the same size as df
.
Here are two approaches that are more memory and time efficient
An approach using Filter
Filter(function(x)!all(is.na(x)), df)
and an approach using data.table (for general time and memory efficiency)
library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]
examples using large data (30 columns, 1e6 rows)
big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F)
bd <- do.call(data.frame,big_data)
names(bd) <- paste0('X',seq_len(30))
DT <- as.data.table(bd)
system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]})
# error -- can't allocate vector of size ...
system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]})
# error -- can't allocate vector of size ...
system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)})
## user system elapsed
## 0.26 0.03 0.29
system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]})
## user system elapsed
## 0.14 0.03 0.18
Solution 3
Update
You can now use select
with the where
selection helper. select_if
is superceded, but still functional as of dplyr 1.0.2. (thanks to @mcstrother for bringing this to attention).
library(dplyr)
temp <- data.frame(x = 1:5, y = c(1,2,NA,4, 5), z = rep(NA, 5))
not_all_na <- function(x) any(!is.na(x))
not_any_na <- function(x) all(!is.na(x))
> temp
x y z
1 1 1 NA
2 2 2 NA
3 3 NA NA
4 4 4 NA
5 5 5 NA
> temp %>% select(where(not_all_na))
x y
1 1 1
2 2 2
3 3 NA
4 4 4
5 5 5
> temp %>% select(where(not_any_na))
x
1 1
2 2
3 3
4 4
5 5
Old Answer
dplyr
now has a select_if
verb that may be helpful here:
> temp
x y z
1 1 1 NA
2 2 2 NA
3 3 NA NA
4 4 4 NA
5 5 5 NA
> temp %>% select_if(not_all_na)
x y
1 1 1
2 2 2
3 3 NA
4 4 4
5 5 5
> temp %>% select_if(not_any_na)
x
1 1
2 2
3 3
4 4
5 5
Solution 4
Late to the game but you can also use the janitor
package. This function will remove columns which are all NA, and can be changed to remove rows that are all NA as well.
df <- janitor::remove_empty(df, which = "cols")
Solution 5
Another way would be to use the apply()
function.
If you have the data.frame
df <- data.frame (var1 = c(1:7,NA),
var2 = c(1,2,1,3,4,NA,NA,9),
var3 = c(NA)
)
then you can use apply()
to see which columns fulfill your condition and so you can simply do the same subsetting as in the answer by Musa, only with an apply
approach.
> !apply (is.na(df), 2, all)
var1 var2 var3
TRUE TRUE FALSE
> df[, !apply(is.na(df), 2, all)]
var1 var2
1 1 1
2 2 2
3 3 1
4 4 3
5 5 4
6 6 NA
7 7 NA
8 NA 9
Related videos on Youtube
Gnark
Updated on November 06, 2021Comments
-
Gnark over 2 years
I'm having trouble with a data frame and couldn't really resolve that issue myself:
The dataframe has arbitrary properties as columns and each row represents one data set.The question is:
How to get rid of columns where for ALL rows the value is NA? -
Darren Cook about 12 yearsI expected this to be quicker, as the colSum() solution seemed to be doing more work. But on my test set (213 obs. of 1614 variables before, vs. 1377 variables afterwards) it takes exactly 3 times longer. (But +1 for an interesting approach.)
-
Matt Dowle over 11 yearsVery nice. You could do the same with
data.frame
, though. There's nothing here that really needsdata.table
. The key is thelapply
, which avoids the copy of the whole object done byis.na(df)
. +10 for pointing that out. -
s_a almost 10 yearsHow would you do it with a data.frame? @matt-dowle
-
mnel almost 10 years@s_a,
bd1 <- bd[, unlist(lapply(bd, function(x), !all(is.na(x))))]
-
Thieme Hennis over 9 years@mnel I think you need to remove the
,
afterfunction(x)
- thanks for the example btw -
mtelesha over 8 yearsThis creates an object the size of the old object which is a problem with memory on large objects. Better to use a function to reduce the size. The answer bellow using Filter or using data.table will help your memory usage.
-
skan almost 8 yearsCan you do it faster with := or with a set() ?
-
verbamour over 7 yearsThis does not appear to work with non-numeric columns.
-
Peter.k almost 6 yearsIt changes column name if they are duplicated
-
Andrew Brēza almost 5 yearsCame here looking for the
dplyr
solution. Was not disappointed. Thanks! -
MBorg almost 4 yearsI found this had the issue that it would also delete variables with most but not all values as missing
-
jeromeResearch over 3 yearsTo do this with non-numeric columns, @mnel's solution using Filter() is a good one. A benchmark of multiple approaches can be found in this post
-
mcstrother over 3 years
select_if
is now superseded in dplyr, so the last two lines would betemp %>% select(where(not_all_na))
in the most recent syntax -- althoughselect_if
still works for now as of dplyr 1.0.2. Alsotemp %>% select(where(~!all(is.na(.x))))
works if you don't feel like defining the function on a separate line. -
zack over 3 years@mcstrother thank you - that is a very helpful update to my answer. If you'd like to answer it yourself I'll happily roll back the edits.
-
Thomas Moore over 2 yearsjanitor::remove_empty() would be more appropriate here. ?remove_empty = "Remove empty rows and/or columns from a data.frame or matrix"
-
Sky Scraper over 2 years
not_any_na
is not found for me. where does this come from? I havedplyr
loaded..... -
zack over 2 years@SkyScraper it's a function defined in the code provided
-
gaspar almost 2 yearsDoesn't seem to work with single-row data frames.