Split data.frame based on levels of a factor into new data.frames

r dataframe r-faq

101,956

Solution 1

I think that split does exactly what you want.

Notice that X is a list of data frames, as seen by str:

X <- split(df, df$g)
str(X)

If you want individual object with the group g names you could assign the elements of X from split to objects of those names, though this seems like extra work when you can just index the data frames from the list split creates.

#I used lapply just to drop the third column g which is no longer needed.
Y <- lapply(seq_along(X), function(x) as.data.frame(X[[x]])[, 1:2]) 

#Assign the dataframes in the list Y to individual objects
A <- Y[[1]]
B <- Y[[2]]
C <- Y[[3]]
D <- Y[[4]]
E <- Y[[5]]

#Or use lapply with assign to assign each piece to an object all at once
lapply(seq_along(Y), function(x) {
    assign(c("A", "B", "C", "D", "E")[x], Y[[x]], envir=.GlobalEnv)
    }
)

Edit Or even better than using lapply to assign to the global environment use list2env:

names(Y) <- c("A", "B", "C", "D", "E")
list2env(Y, envir = .GlobalEnv)
A

Solution 2

Since dplyr 0.8.0 , we can also use group_split which has similar behavior as base::split

library(dplyr)
df %>% group_split(g)

#[[1]]
# A tibble: 5 x 3
#       x      y g    
#   <dbl>  <dbl> <fct>
#1 -1.21  -1.45  A    
#2  0.506  1.10  A    
#3 -0.477 -1.17  A    
#4 -0.110  1.45  A    
#5  0.134 -0.969 A    

#[[2]]
# A tibble: 5 x 3
#       x      y g    
#   <dbl>  <dbl> <fct>
#1  0.277  0.575 B    
#2 -0.575 -0.476 B    
#3 -0.998 -2.18  B    
#4 -0.511 -1.07  B    
#5 -0.491 -1.11  B  
#....

It also comes with argument .keep (which is TRUE by default) to specify whether or not the grouped column should be kept.

df %>% group_split(g, .keep = FALSE)

#[[1]]
# A tibble: 5 x 2
#       x      y
#   <dbl>  <dbl>
#1 -1.21  -1.45 
#2  0.506  1.10 
#3 -0.477 -1.17 
#4 -0.110  1.45 
#5  0.134 -0.969

#[[2]]
# A tibble: 5 x 2
#       x      y
#   <dbl>  <dbl>
#1  0.277  0.575
#2 -0.575 -0.476
#3 -0.998 -2.18 
#4 -0.511 -1.07 
#5 -0.491 -1.11 
#....

The difference between base::split and dplyr::group_split is that group_split does not name the elements of the list based on grouping. So

df1 <- df %>% group_split(g)
names(df1) #gives 
NULL

whereas

df2 <- split(df, df$g)
names(df2) #gives
#[1] "A" "B" "C" "D" "E"

data

set.seed(1234)
df <- data.frame(
      x=rnorm(25),
      y=rnorm(25),
      g=rep(factor(LETTERS[1:5]), 5)
)

101,956

Author by

smillig

Updated on July 08, 2022

Comments

smillig almost 2 years
I'm trying to create separate data.frame objects based on levels of a factor. So if I have:
```
df <- data.frame(
  x=rnorm(25),
  y=rnorm(25),
  g=rep(factor(LETTERS[1:5]), 5)
)
```
How can I split df into separate data.frames for each level of g containing the corresponding x and y values? I can get most of the way there using split(df, df$g), but I'd like the each level of the factor to have its own data.frame.

What's the best way to do this?
smillig about 12 years

Thanks. It was the splitting out each data.frame created using split into individual, separate objects that I was having difficulty with. This is exactly what I was looking for.
PesKchan over 2 years

got it but group_split gives a different result that split?
Ronak Shah over 2 years

Except the names of the list, it should give the same result as split.