Coerce multiple columns to factors at once

136,956

Solution 1

Choose some columns to coerce to factors:

cols <- c("A", "C", "D", "H")

Use lapply() to coerce and replace the chosen columns:

data[cols] <- lapply(data[cols], factor)  ## as.factor() could also be used

Check the result:

sapply(data, class)
#        A         B         C         D         E         F         G 
# "factor" "integer"  "factor"  "factor" "integer" "integer" "integer" 
#        H         I         J 
# "factor" "integer" "integer" 

Solution 2

Here is an option using dplyr. The %<>% operator from magrittr update the lhs object with the resulting value.

library(magrittr)
library(dplyr)
cols <- c("A", "C", "D", "H")

data %<>%
       mutate_each_(funs(factor(.)),cols)
str(data)
#'data.frame':  4 obs. of  10 variables:
# $ A: Factor w/ 4 levels "23","24","26",..: 1 2 3 4
# $ B: int  15 13 39 16
# $ C: Factor w/ 4 levels "3","5","18","37": 2 1 3 4
# $ D: Factor w/ 4 levels "2","6","28","38": 3 1 4 2
# $ E: int  14 4 22 20
# $ F: int  7 19 36 27
# $ G: int  35 40 21 10
# $ H: Factor w/ 4 levels "11","29","32",..: 1 4 3 2
# $ I: int  17 1 9 25
# $ J: int  12 30 8 33

Or if we are using data.table, either use a for loop with set

setDT(data)
for(j in cols){
  set(data, i=NULL, j=j, value=factor(data[[j]]))
}

Or we can specify the 'cols' in .SDcols and assign (:=) the rhs to 'cols'

setDT(data)[, (cols):= lapply(.SD, factor), .SDcols=cols]

Solution 3

The more recent tidyverse way is to use the mutate_at function:

library(tidyverse)
library(magrittr)
set.seed(88)

data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
cols <- c("A", "C", "D", "H")

data %<>% mutate_at(cols, factor)
str(data)
 $ A: Factor w/ 4 levels "5","17","18",..: 2 1 4 3   
 $ B: int  36 35 2 26
 $ C: Factor w/ 4 levels "22","31","32",..: 1 2 4 3
 $ D: Factor w/ 4 levels "1","9","16","39": 3 4 1 2
 $ E: int  3 14 30 38
 $ F: int  27 15 28 37
 $ G: int  19 11 6 21
 $ H: Factor w/ 4 levels "7","12","20",..: 1 3 4 2
 $ I: int  23 24 13 8
 $ J: int  10 25 4 33

Solution 4

You can use mutate_if (dplyr):

For example, coerce integer in factor:

mydata=structure(list(a = 1:10, b = 1:10, c = c("a", "a", "b", "b", 
"c", "c", "c", "c", "c", "c")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

# A tibble: 10 x 3
       a     b c    
   <int> <int> <chr>
 1     1     1 a    
 2     2     2 a    
 3     3     3 b    
 4     4     4 b    
 5     5     5 c    
 6     6     6 c    
 7     7     7 c    
 8     8     8 c    
 9     9     9 c    
10    10    10 c   

Use the function:

library(dplyr)

mydata%>%
    mutate_if(is.integer,as.factor)

# A tibble: 10 x 3
       a     b c    
   <fct> <fct> <chr>
 1     1     1 a    
 2     2     2 a    
 3     3     3 b    
 4     4     4 b    
 5     5     5 c    
 6     6     6 c    
 7     7     7 c    
 8     8     8 c    
 9     9     9 c    
10    10    10 c    

Solution 5

and, for completeness and with regards to this question asking about changing string columns only, there's mutate_if:

data <- cbind(stringVar = sample(c("foo","bar"),10,replace=TRUE),
              data.frame(matrix(sample(1:40), 10, 10, dimnames = list(1:10, LETTERS[1:10]))),stringsAsFactors=FALSE)     

factoredData = data %>% mutate_if(is.character,funs(factor(.)))
Share:
136,956

Related videos on Youtube

wsda
Author by

wsda

Updated on December 03, 2021

Comments

  • wsda
    wsda over 2 years

    I have a sample data frame like below:

    data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
    

    I want to know how can I select multiple columns and convert them together to factors. I usually do it in the way like data$A = as.factor(data$A). But when the data frame is very large and contains lots of columns, this way will be very time consuming. Does anyone know of a better way to do it?

  • TayTay
    TayTay over 8 years
    Wouldn't it need to be data[,cols] <- lapply(data[,cols], factor) (with the leading comma for columns)?
  • Rich Scriven
    Rich Scriven over 8 years
    @Tgsmith61591- It could be either. With the comma is a matrix-type subset, without the comma is a list subset. Data frames can be subsetted by either one so either way would work.
  • cbrnr
    cbrnr about 6 years
    You don't even need to use funs if you only perform one transformation; mutate_at(cols, factor) is sufficient.
  • Ben
    Ben almost 6 years
    How can this solution be expanded to include factor levels and labels?
  • Rich Scriven
    Rich Scriven over 5 years
    @Ben - It's probably best to ask a new question
  • Microscone
    Microscone almost 5 years
    This is a great solution, and my go-to code now for changing column classes. However, I think using sapply to view the classes is ugly/hard to read. str(data) works better for me.
  • Tan Naidu
    Tan Naidu almost 5 years
    To add to Rich Scriven's answer, I had too many columns and didn't want to name all of them. I ended up using indices such as in sample below: cols <- c(2, 5, 7, 14:16) data[cols] <- lapply(data[cols], factor)
  • Brian D
    Brian D almost 5 years
    @Ben you can specify labels and levels by extending the answer: data[cols] <- lapply(data[cols], factor, levels=c("val1", "val2", ...), labels=c("label1", "label2", ...) be careful with this though... all of the variables will use the same levels and labels you provide.
  • Casey Jayne
    Casey Jayne over 2 years
    can you add your citation for why we need/should use 'across'? I don't see it in R4DS or the ?dplyr page
  • GuedesBF
    GuedesBF over 2 years
    dplyr.tidyverse.org/reference/across.html "across() supersedes the family of "scoped variants" like summarise_at(), summarise_if(), and summarise_all()."