sample rows of subgroups from dataframe with dplyr

21,587

Solution 1

Yes, you can use dplyr elegantly by the function do(). Here is an example:

mtcars %>% 
    group_by(cyl) %>%
    do(sample_n(.,2))

and the results are like this

Source: local data frame [6 x 11]
Groups: cyl

   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
3 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
4 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
5 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
6 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

Update:

The do function is no longer needed for sample_n in newer versions of dplyr. Current code for taking a random sample of two rows per group:

mtcars %>% 
    group_by(cyl) %>% 
    sample_n(2)

Solution 2

This is easy to do with data.table, and useful for a big table.

NOTE: As mentioned in the coments by Troy, there is a more effiecient way of doing this using data.table, but i wanted to respect the OP sample function and format in the answer.

require(data.table)
DT <- data.table(x = rnorm(10e6, 100, 50), y = letters)

sampleGroup<-function(df,size) {
  df[sample(nrow(df),size=size),]
}

result <- DT[, sampleGroup(.SD, 10), by=y]
print(result)

# y         x y
# 1: a  30.11659 m
# 2: a  57.99974 h
# 3: a  58.13634 o
# 4: a  87.28466 x
# 5: a  85.54986 j
# ---              
# 256: z 149.85817 d
# 257: z 160.24293 e
# 258: z  26.63071 j
# 259: z  17.00083 t
# 260: z 130.27796 f

system.time(DT[, sampleGroup(.SD, 10), by=y])
# user  system elapsed 
# 0.66    0.02    0.69 

Using the iris dataset:
iris <- data.table(iris)
iris[,sampleGroup(.SD, 10), by=Species]

Solution 3

That's a good question! Can't see any easy way to do it with the documented syntax for dplyr but how about this for a workaround?

sampleGroup<-function(df,x=1){

  df[
    unlist(lapply(attr((df),"indices"),function(r)sample(r,min(length(r),x))))
    ,]

}

sampleGroup(iris %.% group_by(Species),3)

#Source: local data frame [9 x 5]
#Groups: Species
#
#    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#39           4.4         3.0          1.3         0.2     setosa
#16           5.7         4.4          1.5         0.4     setosa
#25           4.8         3.4          1.9         0.2     setosa
#51           7.0         3.2          4.7         1.4 versicolor
#62           5.9         3.0          4.2         1.5 versicolor
#59           6.6         2.9          4.6         1.3 versicolor
#148          6.5         3.0          5.2         2.0  virginica
#103          7.1         3.0          5.9         2.1  virginica
#120          6.0         2.2          5.0         1.5  virginica

EDIT - PERFORMANCE COMPARISON

Here's a test against using data.table (both native and with a function call as per the example) for 1m rows, 26 groups.

Native data.table is about 2x as fast as the dplyr workaround and also than data.table call with callout. So probably dplyr / data.table are about the same performance.

Hopefully the dplyr guys will give us some native syntax for sampling soon! (or even better, maybe it's already there)

sampleGroup.dt<-function(df,size) {
  df[sample(nrow(df),size=size),]
}

testdata<-data.frame(group=sample(letters,10e5,T),runif(10e5))

dti<-data.table(testdata)

# using the dplyr workaround with external function call
system.time(sampleGroup(testdata %.% group_by(group),10))
#user  system elapsed 
#0.07    0.00    0.06 

#using native data.table
system.time(dti[dti[,list(val=sample(.I,10)),by="group"]$val])
#user  system elapsed 
#0.04    0.00    0.03 

#using data.table with external function call
system.time(dti[, sampleGroup.dt(dti, 10), by=group])
#user  system elapsed 
#0.06    0.02    0.08 

Solution 4

Dplyr 1.0.2 can subset with various verbs now: https://dplyr.tidyverse.org/reference/slice.html including random slice_sample:

mtcars %>% 
  slice_sample(n = 10)

and add a group by to sample by a category:

mtcars %>% 
  group_by(cyl) %>% 
  slice_sample(n = 2)
Share:
21,587
Robert
Author by

Robert

Updated on July 30, 2022

Comments

  • Robert
    Robert over 1 year

    If I want to randomly select some samples from different groups I use the plyr package and the code below

    require(plyr)
    sampleGroup<-function(df,size) {
      df[sample(nrow(df),size=size),]
    }
    
    iris.sample<-ddply(iris,.(Species),function(df) sampleGroup(df,10))
    

    Here 10 samples are selected from each species.

    Some of my dataframes are very big and my question is can I use the same sampleGroup function with the dplyr package? Or is there another way to do the same in dplyr?

    EDIT

    Version 0.2 of the dplyr package introduced two new functions to select random rows from a table sample_n and sample_frac

  • Troy
    Troy over 10 years
    +1 for data.table. Using .I doubles the performance speed: iris[iris[,list(idx=sample(.I,10)),by="Species"]$idx]
  • marbel
    marbel over 10 years
    +1 for Troy´s anser using data.table in the rigth way. My answer is probably slower because it copies two times the table.
  • Arun
    Arun over 10 years
    +1 very nice comparisons. But I don't understand your reason for the last benchmark? You're sample the whole data for 10 elements for every group. Whereas you're doing something with attributes for dplyr case.. Why not benchmark the same for dplyr with a function similar to the 3rd case for DT as well?
  • Arun
    Arun over 10 years
    Also, an important aspect of benchmarking is to see how well it scales. With just 26 groups to aggregate by, there'll be no real difference one can detect. Change your line to testdata<-data.frame(group=sample(paste("id", 1:1e5, sep=""),10e5,T),runif(10e5)) and run your benchmarks again
  • eddi
    eddi over 10 years
    I think you want sampleGroup(.SD, 10) (note .SD instead of DT)
  • Romain Francois
    Romain Francois about 10 years
    Please note that the internals of dplyr (e.g. the indices attributes) are likely to evolve. Don't rely on their structure.
  • PhilChang
    PhilChang about 10 years
    @Arun , Yes, But you should update the dplyr to the newest version 0.1.3.0.99.
  • PhilChang
    PhilChang about 10 years
    @Arun,sorry, you should use sample_n()
  • Brani
    Brani almost 10 years
    Is there a way to do this without using do?
  • gregmacfarlane
    gregmacfarlane over 9 years
    Can you clock your stuff against the data.table solutions above? I stay in dplyr as much as I can because the grammar is easier (or at least I haven't learned data.table yet). It kind of drives me crazy that every dplyr question on SO gets a data.table answer, so I would like to see if this new code gets close.
  • marbel
    marbel about 8 years
    @gregmacfarlane Just read the comments above and it will make sense. There wasn't an acceptable way to do this with dplyr at the time. After reading the current docs at the time, the OP answered: " Thanks, but I think the solution to this problem is not in the documentation yet. Nice solution with data.table though! – Robert". Also read the other answers from the time the question was ask, they don't look like amazing solutions...
  • user3614783
    user3614783 about 6 years
    @PhilChang I get this error message when I run the following code: clickers %>% group_by(ListName)%>% sample_n(200) Error: size must be less or equal than 29 (size of data), set replace = TRUE to use sampling with replacement
  • Bastien
    Bastien about 5 years
    @user3614783, use sample_n(min(n(),200). The problem is that some of your groups are not 200 row longs.