How to delete columns that contain ONLY NAs?

128,255

Solution 1

One way of doing it:

df[, colSums(is.na(df)) != nrow(df)]

If the count of NAs in a column is equal to the number of rows, it must be entirely NA.

Or similarly

df[colSums(!is.na(df)) > 0]

Solution 2

Here is a dplyr solution:

df %>% select_if(~sum(!is.na(.)) > 0)

Update: The summarise_if() function is superseded as of dplyr 1.0. Here are two other solutions that use the where() tidyselect function:

df %>% 
  select(
    where(
      ~sum(!is.na(.x)) > 0
    )
  )
df %>% 
  select(
    where(
      ~!all(is.na(.x))
    )
  )

Solution 3

Another option is the janitor package:

df <- remove_empty_cols(df)

https://github.com/sfirke/janitor

Solution 4

It seeems like you want to remove ONLY columns with ALL NAs, leaving columns with some rows that do have NAs. I would do this (but I am sure there is an efficient vectorised soution:

#set seed for reproducibility
set.seed <- 103
df <- data.frame( id = 1:10 , nas = rep( NA , 10 ) , vals = sample( c( 1:3 , NA ) , 10 , repl = TRUE ) )
df
#      id nas vals
#   1   1  NA   NA
#   2   2  NA    2
#   3   3  NA    1
#   4   4  NA    2
#   5   5  NA    2
#   6   6  NA    3
#   7   7  NA    2
#   8   8  NA    3
#   9   9  NA    3
#   10 10  NA    2

#Use this command to remove columns that are entirely NA values, it will leave columns where only some values are NA
df[ , ! apply( df , 2 , function(x) all(is.na(x)) ) ]
#      id vals
#   1   1   NA
#   2   2    2
#   3   3    1
#   4   4    2
#   5   5    2
#   6   6    3
#   7   7    2
#   8   8    3
#   9   9    3
#   10 10    2

If you find yourself in the situation where you want to remove columns that have any NA values you can simply change the all command above to any.

Solution 5

An intuitive script: dplyr::select_if(~!all(is.na(.))). It literally keeps only not-all-elements-missing columns. (to delete all-element-missing columns).

> df <- data.frame( id = 1:10 , nas = rep( NA , 10 ) , vals = sample( c( 1:3 , NA ) , 10 , repl = TRUE ) )

> df %>% glimpse()
Observations: 10
Variables: 3
$ id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ nas  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ vals <int> NA, 1, 1, NA, 1, 1, 1, 2, 3, NA

> df %>% select_if(~!all(is.na(.))) 
   id vals
1   1   NA
2   2    1
3   3    1
4   4   NA
5   5    1
6   6    1
7   7    1
8   8    2
9   9    3
10 10   NA
Share:
128,255
Lorenzo Rigamonti
Author by

Lorenzo Rigamonti

Updated on January 06, 2022

Comments

  • Lorenzo Rigamonti
    Lorenzo Rigamonti over 2 years

    I have a data.frame containing some columns with all NA values. How can I delete them from the data.frame?

    Can I use the function,

    na.omit(...) 
    

    specifying some additional arguments?

  • Lorenzo Rigamonti
    Lorenzo Rigamonti about 11 years
    The data.frame has two type of columns: one in whohc all values are numbers and the other in which all values are NA
  • Simon O'Hanlon
    Simon O'Hanlon about 11 years
    So this will work then. It only removes columns were ALL values are NA.
  • Ciarán Tobin
    Ciarán Tobin about 11 years
    Good solution. I would do apply(is.na(df), 1, all) though just because it's slightly neater and is.na() is used on all of df rather than one row at a time (show be a bit faster).
  • Simon O'Hanlon
    Simon O'Hanlon about 11 years
    @MadScone good tip - does look neater. You should apply across columns not rows though.
  • Simon O'Hanlon
    Simon O'Hanlon about 11 years
    @MadScone Edits are locked after 5 minutes on comments. I shouldn't worry, it's no biggie!! :-)
  • discipulus
    discipulus about 9 years
    How can I delete columns having more than a threshold of NA? or in Percentage (lets say above 50%)?
  • discipulus
    discipulus about 9 years
    How can I delete columns having more than a threshold of NA? or in Percentage (lets say above 50%)?
  • Ciarán Tobin
    Ciarán Tobin about 9 years
    @lovedynasty Probably best to submit a separate question, assuming you haven't already since posting your comment. But anyway, you can always do something like df[, colSums(is.na(df)) < nrow(df) * 0.5] i.e. only keep columns with at least 50% non-blanks.
  • Boern
    Boern over 8 years
    People working with a correlation matrix must use df[, colSums(is.na(df)) != nrow(df) - 1] since the diagonal is always 1
  • rawr
    rawr about 8 years
    @SimonO'Hanlon three years later.. are you still setting seeds like this? :}
  • Stefan Avey
    Stefan Avey over 7 years
    Can use this with the dplyr (version 0.5.0) select_if function as well. df %>% select_if(colSums(!is.na(.)) > 0)
  • EngrStudent
    EngrStudent over 6 years
    At ~15k rows and ~5k columns, this is truly taking forever.
  • EngrStudent
    EngrStudent over 6 years
    I did this on a data table and it became a vector. Nearly gave me a heart attack. Had to convert to a frame. It ran a lot faster.
  • André.B
    André.B about 5 years
    janitor::remove_empty_cols() is deprecated - use df <- janitor::remove_empty(df, which = "cols")
  • Scorpy
    Scorpy over 4 years
    @MadScone it is giving me syntax error at "," for df[, colSums(is.na(df)) != nrow(df)] and syntax error at "!" in df[colSums(!is.na(df)) > 0]. Am i missing something
  • johnny
    johnny about 4 years
    @EngrStudent Was it faster with the accepted answer's solution?
  • EngrStudent
    EngrStudent about 4 years
    It's been a number of years. I don't remember. DJV has a nice timing post below.
  • EngrStudent
    EngrStudent about 4 years
    Sometimes the first iteration is a JIT compiled, so it has very poor, and not very characteristic, times. I think it’s interesting what the larger sample size does to the right tails of the distribution. This is good work.
  • DJV
    DJV about 4 years
    I run it once again, wasn't sure I changed the plot. Regarding the distribution, indeed. I should probably compare different sample sizes when I'll have the time.
  • EngrStudent
    EngrStudent about 4 years
    if you qqplot (ggplot2.tidyverse.org/reference/geom_qq.html) one of the trends, such as "akrun" then I bet there is one point that is very different from the distribution of the rest. The rest represent how long it takes if you run it repeatedly, but that represents what happens if you run it once. There is an old saying: you can have 20 years of experience or you can have only one years worth of experience 20 times.
  • EngrStudent
    EngrStudent about 4 years
    very nice! I’m surprised by several samples being in the extreme tail. I wonder why it is that those are so much more costly. JIT might be 1 or 2 but not 20. Condition? Interrupts? Other? Thanks again for the update.
  • DJV
    DJV about 4 years
    You're welcome, thank you for the thoughts. Don't know, I actually allowed it to run "freely".
  • Amit Kohli
    Amit Kohli about 2 years
    even remove_empty() works