How to delete columns that contain ONLY NAs?

r dataframe na

128,255

Solution 1

One way of doing it:

df[, colSums(is.na(df)) != nrow(df)]

If the count of NAs in a column is equal to the number of rows, it must be entirely NA.

Or similarly

df[colSums(!is.na(df)) > 0]

Solution 2

Here is a dplyr solution:

df %>% select_if(~sum(!is.na(.)) > 0)

Update: The summarise_if() function is superseded as of dplyr 1.0. Here are two other solutions that use the where() tidyselect function:

df %>% 
  select(
    where(
      ~sum(!is.na(.x)) > 0
    )
  )

df %>% 
  select(
    where(
      ~!all(is.na(.x))
    )
  )

Solution 3

Another option is the janitor package:

df <- remove_empty_cols(df)

https://github.com/sfirke/janitor

Solution 4

It seeems like you want to remove ONLY columns with ALL NAs, leaving columns with some rows that do have NAs. I would do this (but I am sure there is an efficient vectorised soution:

#set seed for reproducibility
set.seed <- 103
df <- data.frame( id = 1:10 , nas = rep( NA , 10 ) , vals = sample( c( 1:3 , NA ) , 10 , repl = TRUE ) )
df
#      id nas vals
#   1   1  NA   NA
#   2   2  NA    2
#   3   3  NA    1
#   4   4  NA    2
#   5   5  NA    2
#   6   6  NA    3
#   7   7  NA    2
#   8   8  NA    3
#   9   9  NA    3
#   10 10  NA    2

#Use this command to remove columns that are entirely NA values, it will leave columns where only some values are NA
df[ , ! apply( df , 2 , function(x) all(is.na(x)) ) ]
#      id vals
#   1   1   NA
#   2   2    2
#   3   3    1
#   4   4    2
#   5   5    2
#   6   6    3
#   7   7    2
#   8   8    3
#   9   9    3
#   10 10    2

If you find yourself in the situation where you want to remove columns that have any NA values you can simply change the all command above to any.

Solution 5

An intuitive script: dplyr::select_if(~!all(is.na(.))). It literally keeps only not-all-elements-missing columns. (to delete all-element-missing columns).

> df <- data.frame( id = 1:10 , nas = rep( NA , 10 ) , vals = sample( c( 1:3 , NA ) , 10 , repl = TRUE ) )

> df %>% glimpse()
Observations: 10
Variables: 3
$ id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ nas  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ vals <int> NA, 1, 1, NA, 1, 1, 1, 2, 3, NA

> df %>% select_if(~!all(is.na(.))) 
   id vals
1   1   NA
2   2    1
3   3    1
4   4   NA
5   5    1
6   6    1
7   7    1
8   8    2
9   9    3
10 10   NA

View more solutions

128,255

Author by

Lorenzo Rigamonti

Updated on January 06, 2022

Comments

Lorenzo Rigamonti over 2 years
I have a data.frame containing some columns with all NA values. How can I delete them from the data.frame?

Can I use the function,
```
na.omit(...) 
```
specifying some additional arguments?
Lorenzo Rigamonti about 11 years

The data.frame has two type of columns: one in whohc all values are numbers and the other in which all values are NA
Simon O'Hanlon about 11 years

So this will work then. It only removes columns were ALL values are NA.
Ciarán Tobin about 11 years

Good solution. I would do apply(is.na(df), 1, all) though just because it's slightly neater and is.na() is used on all of df rather than one row at a time (show be a bit faster).
Simon O'Hanlon about 11 years

@MadScone good tip - does look neater. You should apply across columns not rows though.
Simon O'Hanlon about 11 years

@MadScone Edits are locked after 5 minutes on comments. I shouldn't worry, it's no biggie!! :-)
discipulus about 9 years

How can I delete columns having more than a threshold of NA? or in Percentage (lets say above 50%)?
discipulus about 9 years

How can I delete columns having more than a threshold of NA? or in Percentage (lets say above 50%)?
Ciarán Tobin about 9 years

@lovedynasty Probably best to submit a separate question, assuming you haven't already since posting your comment. But anyway, you can always do something like df[, colSums(is.na(df)) < nrow(df) * 0.5] i.e. only keep columns with at least 50% non-blanks.
Boern over 8 years

People working with a correlation matrix must use df[, colSums(is.na(df)) != nrow(df) - 1] since the diagonal is always 1
rawr about 8 years

@SimonO'Hanlon three years later.. are you still setting seeds like this? :}
Stefan Avey over 7 years

Can use this with the dplyr (version 0.5.0) select_if function as well. df %>% select_if(colSums(!is.na(.)) > 0)
EngrStudent over 6 years

At ~15k rows and ~5k columns, this is truly taking forever.
EngrStudent over 6 years

I did this on a data table and it became a vector. Nearly gave me a heart attack. Had to convert to a frame. It ran a lot faster.
André.B about 5 years

janitor::remove_empty_cols() is deprecated - use df <- janitor::remove_empty(df, which = "cols")
Scorpy over 4 years

@MadScone it is giving me syntax error at "," for df[, colSums(is.na(df)) != nrow(df)] and syntax error at "!" in df[colSums(!is.na(df)) > 0]. Am i missing something
johnny about 4 years

@EngrStudent Was it faster with the accepted answer's solution?
EngrStudent about 4 years

It's been a number of years. I don't remember. DJV has a nice timing post below.
EngrStudent about 4 years

Sometimes the first iteration is a JIT compiled, so it has very poor, and not very characteristic, times. I think it’s interesting what the larger sample size does to the right tails of the distribution. This is good work.
DJV about 4 years

I run it once again, wasn't sure I changed the plot. Regarding the distribution, indeed. I should probably compare different sample sizes when I'll have the time.
EngrStudent about 4 years

if you qqplot (ggplot2.tidyverse.org/reference/geom_qq.html) one of the trends, such as "akrun" then I bet there is one point that is very different from the distribution of the rest. The rest represent how long it takes if you run it repeatedly, but that represents what happens if you run it once. There is an old saying: you can have 20 years of experience or you can have only one years worth of experience 20 times.
EngrStudent about 4 years

very nice! I’m surprised by several samples being in the extreme tail. I wonder why it is that those are so much more costly. JIT might be 1 or 2 but not 20. Condition? Interrupts? Other? Thanks again for the update.
DJV about 4 years

You're welcome, thank you for the thoughts. Don't know, I actually allowed it to run "freely".
Amit Kohli about 2 years

even remove_empty() works