Select columns based on string match - dplyr::select

136,741

Solution 1

Within the dplyr world, try:

select(iris,contains("Sepal"))

See the Selection section in ?select for numerous other helpers like starts_with, ends_with, etc.

Solution 2

You can try:

select(data, matches("search_string"))

It is more general than contains - you can use regex (e.g. "one_string|or_the_other").

For more examples, see: http://rpackages.ianhowson.com/cran/dplyr/man/select.html.

Solution 3

No need to use select just use [ instead

data[,grepl("search_string", colnames(data))]

Let's try with iris dataset

>iris[,grepl("Sepal", colnames(iris))]
  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
4          4.6         3.1
5          5.0         3.6
6          5.4         3.9

Solution 4

Based on Piotr Migdals response I want to give an alternate solution enabling the possibility for a vector of strings:

myVectorOfStrings <- c("foo", "bar")
matchExpression <- paste(myVectorOfStrings, collapse = "|")
# [1] "foo|bar"
df %>% select(matches(matchExpression))

Making use of the regex OR operator (|)

ATTENTION: If you really have a plain vector of column names (and do not need the power of RegExpression), please see the comment below this answer (since it's the cleaner solution).

Share:
136,741
Timm S.
Author by

Timm S.

Commonly seen somewhere between HR and data science.

Updated on February 08, 2022

Comments

  • Timm S.
    Timm S. about 2 years

    I have a data frame ("data") with lots and lots of columns. Some of the columns contain a certain string ("search_string").

    How can I use dplyr::select() to give me a subset including only the columns that contain the string?

    I tried:

    # columns as boolean vector
    select(data, grepl("search_string",colnames(data)))
    
    # columns as vector of column names names 
    select(data, colnames(data)[grepl("search_string",colnames(data))]) 
    

    Neither of them work.

    I know that select() accepts numeric vectors as substitute for columns e.g.:

    select(data,5,7,9:20)
    

    But I don't know how to get a numeric vector of columns IDs from my grepl() expression.