How to delete columns that contain ONLY NAs?
Solution 1
One way of doing it:
df[, colSums(is.na(df)) != nrow(df)]
If the count of NAs in a column is equal to the number of rows, it must be entirely NA.
Or similarly
df[colSums(!is.na(df)) > 0]
Solution 2
Here is a dplyr solution:
df %>% select_if(~sum(!is.na(.)) > 0)
Update: The summarise_if()
function is superseded as of dplyr 1.0
. Here are two other solutions that use the where()
tidyselect function:
df %>%
select(
where(
~sum(!is.na(.x)) > 0
)
)
df %>%
select(
where(
~!all(is.na(.x))
)
)
Solution 3
Another option is the janitor
package:
df <- remove_empty_cols(df)
https://github.com/sfirke/janitor
Solution 4
It seeems like you want to remove ONLY columns with ALL NA
s, leaving columns with some rows that do have NA
s. I would do this (but I am sure there is an efficient vectorised soution:
#set seed for reproducibility
set.seed <- 103
df <- data.frame( id = 1:10 , nas = rep( NA , 10 ) , vals = sample( c( 1:3 , NA ) , 10 , repl = TRUE ) )
df
# id nas vals
# 1 1 NA NA
# 2 2 NA 2
# 3 3 NA 1
# 4 4 NA 2
# 5 5 NA 2
# 6 6 NA 3
# 7 7 NA 2
# 8 8 NA 3
# 9 9 NA 3
# 10 10 NA 2
#Use this command to remove columns that are entirely NA values, it will leave columns where only some values are NA
df[ , ! apply( df , 2 , function(x) all(is.na(x)) ) ]
# id vals
# 1 1 NA
# 2 2 2
# 3 3 1
# 4 4 2
# 5 5 2
# 6 6 3
# 7 7 2
# 8 8 3
# 9 9 3
# 10 10 2
If you find yourself in the situation where you want to remove columns that have any NA
values you can simply change the all
command above to any
.
Solution 5
An intuitive script: dplyr::select_if(~!all(is.na(.)))
. It literally keeps only not-all-elements-missing columns. (to delete all-element-missing columns).
> df <- data.frame( id = 1:10 , nas = rep( NA , 10 ) , vals = sample( c( 1:3 , NA ) , 10 , repl = TRUE ) )
> df %>% glimpse()
Observations: 10
Variables: 3
$ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ nas <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ vals <int> NA, 1, 1, NA, 1, 1, 1, 2, 3, NA
> df %>% select_if(~!all(is.na(.)))
id vals
1 1 NA
2 2 1
3 3 1
4 4 NA
5 5 1
6 6 1
7 7 1
8 8 2
9 9 3
10 10 NA
Lorenzo Rigamonti
Updated on January 06, 2022Comments
-
Lorenzo Rigamonti over 2 years
I have a data.frame containing some columns with all NA values. How can I delete them from the data.frame?
Can I use the function,
na.omit(...)
specifying some additional arguments?
-
Lorenzo Rigamonti about 11 yearsThe data.frame has two type of columns: one in whohc all values are numbers and the other in which all values are NA
-
Simon O'Hanlon about 11 yearsSo this will work then. It only removes columns were ALL values are
NA
. -
Ciarán Tobin about 11 yearsGood solution. I would do
apply(is.na(df), 1, all)
though just because it's slightly neater andis.na()
is used on all ofdf
rather than one row at a time (show be a bit faster). -
Simon O'Hanlon about 11 years@MadScone good tip - does look neater. You should apply across columns not rows though.
-
Simon O'Hanlon about 11 years@MadScone Edits are locked after 5 minutes on comments. I shouldn't worry, it's no biggie!! :-)
-
discipulus about 9 yearsHow can I delete columns having more than a threshold of NA? or in Percentage (lets say above 50%)?
-
discipulus about 9 yearsHow can I delete columns having more than a threshold of NA? or in Percentage (lets say above 50%)?
-
Ciarán Tobin about 9 years@lovedynasty Probably best to submit a separate question, assuming you haven't already since posting your comment. But anyway, you can always do something like
df[, colSums(is.na(df)) < nrow(df) * 0.5]
i.e. only keep columns with at least 50% non-blanks. -
Boern over 8 yearsPeople working with a correlation matrix must use
df[, colSums(is.na(df)) != nrow(df) - 1]
since the diagonal is always1
-
rawr about 8 years@SimonO'Hanlon three years later.. are you still setting seeds like this? :}
-
Stefan Avey over 7 yearsCan use this with the dplyr (version 0.5.0) select_if function as well.
df %>% select_if(colSums(!is.na(.)) > 0)
-
EngrStudent over 6 yearsAt ~15k rows and ~5k columns, this is truly taking forever.
-
EngrStudent over 6 yearsI did this on a data table and it became a vector. Nearly gave me a heart attack. Had to convert to a frame. It ran a lot faster.
-
André.B about 5 years
janitor::remove_empty_cols()
is deprecated - usedf <- janitor::remove_empty(df, which = "cols")
-
Scorpy over 4 years@MadScone it is giving me syntax error at "," for df[, colSums(is.na(df)) != nrow(df)] and syntax error at "!" in df[colSums(!is.na(df)) > 0]. Am i missing something
-
johnny about 4 years@EngrStudent Was it faster with the accepted answer's solution?
-
EngrStudent about 4 yearsIt's been a number of years. I don't remember. DJV has a nice timing post below.
-
EngrStudent about 4 yearsSometimes the first iteration is a JIT compiled, so it has very poor, and not very characteristic, times. I think it’s interesting what the larger sample size does to the right tails of the distribution. This is good work.
-
DJV about 4 yearsI run it once again, wasn't sure I changed the plot. Regarding the distribution, indeed. I should probably compare different sample sizes when I'll have the time.
-
EngrStudent about 4 yearsif you qqplot (ggplot2.tidyverse.org/reference/geom_qq.html) one of the trends, such as "akrun" then I bet there is one point that is very different from the distribution of the rest. The rest represent how long it takes if you run it repeatedly, but that represents what happens if you run it once. There is an old saying: you can have 20 years of experience or you can have only one years worth of experience 20 times.
-
EngrStudent about 4 yearsvery nice! I’m surprised by several samples being in the extreme tail. I wonder why it is that those are so much more costly. JIT might be 1 or 2 but not 20. Condition? Interrupts? Other? Thanks again for the update.
-
DJV about 4 yearsYou're welcome, thank you for the thoughts. Don't know, I actually allowed it to run "freely".
-
Amit Kohli about 2 yearseven
remove_empty()
works