How do I select columns that may or may not exist?

12,763

Solution 1

In the devel version of dplyr

df %>%
   select(year, contains("boo"))
#     year
#1  2000
#2  2001
#3  2002
#4  2003
#5  2004
#6  2005
#7  2006
#8  2007
#9  2008
#10 2009
#11 2010

gives the expected output

Otherwise one option would be to use one_of

df %>%
   select(one_of("year", "boo"))

It returns a warning message if the column is not available

Other option is matches

df %>%
  select(matches("year|boo"))

Solution 2

You can use any_of() (from the tidyselect package):

df %>% select(any_of(c("year", "boo")))

Solution 3

Here's a slight twist using dplyr::select_if() that will not throw an Unknown columns: warning if you try to select a column name does not exist, in this case 'bad_column':

df %>% 
  select_if(names(.) %in% c('year', 'bar', 'bad_column'))
Share:
12,763
Lyngbakr
Author by

Lyngbakr

Updated on June 06, 2022

Comments

  • Lyngbakr
    Lyngbakr almost 2 years

    I have a data frame that may or may not have some particular columns present. I want to select columns using dplyr if they do exist and, if not, just ignore that I tried to select them. Here's an example:

    # Load libraries
    library(dplyr)
    
    # Create data frame
    df <- data.frame(year = 2000:2010, foo = 0:10, bar = 10:20)
    
    # Pull out some columns
    df %>% select(year, contains("bar"))
    
    # Result
    #    year bar
    # 1  2000  10
    # 2  2001  11
    # 3  2002  12
    # 4  2003  13
    # 5  2004  14
    # 6  2005  15
    # 7  2006  16
    # 8  2007  17
    # 9  2008  18
    # 10 2009  19
    # 11 2010  20
    
    # Try again for non-existent column
    df %>% select(year, contains("boo"))
    
    # Result
    #data frame with 0 columns and 11 rows
    

    In the latter case, I just want to return a data frame with the column year since the column boo doesn't exist. My question is why do I get an empty data frame in the latter case and what is a good way of avoiding this and achieving the desired result?

    EDIT: Session info

    R version 3.3.3 (2017-03-06)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows 7 x64 (build 7601) Service Pack 1
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
    [1] dplyr_0.5.0
    
    loaded via a namespace (and not attached):
    [1] lazyeval_0.2.0   magrittr_1.5     R6_2.2.0         assertthat_0.2.0 DBI_0.6-1        tools_3.3.3     
    [7] tibble_1.3.0     Rcpp_0.12.10    
    
  • David Robinson
    David Robinson almost 7 years
    I don't think this answers the question, which is about an apparent bug in dplyr. select(year, contains("boo")) should include year in the output.
  • akrun
    akrun almost 7 years
    @Patronus I showed a way to get the expected output. His question is ` I just want to return a data frame with the column year since the column boo doesn't exist`.
  • Lionel Trebuchon
    Lionel Trebuchon almost 5 years
    Also works with "-". %>% select(-one_of("not_wanted_variable")) will remove not_wanted_variable from your data.frame
  • Sylvia Rodriguez
    Sylvia Rodriguez almost 3 years
    Note that 'matches' returns any column that contains "year" or "boo", so if you have column names "year1", "year2", etc, those will all be returned.
  • akrun
    akrun almost 3 years
    @SylviaRodriguez this is an old post. You can modify those with matches("^(year\\d+$)|boo")
  • Pake
    Pake over 2 years
    I prefer this method because select(any_of(c())) respects the order that column names are listed in, and organizes them accordingly in the output.