sample rows of subgroups from dataframe with dplyr
Solution 1
Yes, you can use dplyr elegantly by the function do(). Here is an example:
mtcars %>%
group_by(cyl) %>%
do(sample_n(.,2))
and the results are like this
Source: local data frame [6 x 11]
Groups: cyl
mpg cyl disp hp drat wt qsec vs am gear carb
1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
3 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
4 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
5 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
6 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Update:
The do
function is no longer needed for sample_n
in newer versions of dplyr. Current code for taking a random sample of two rows per group:
mtcars %>%
group_by(cyl) %>%
sample_n(2)
Solution 2
This is easy to do with data.table, and useful for a big table.
NOTE: As mentioned in the coments by Troy, there is a more effiecient way of doing this using data.table, but i wanted to respect the OP sample function and format in the answer.
require(data.table)
DT <- data.table(x = rnorm(10e6, 100, 50), y = letters)
sampleGroup<-function(df,size) {
df[sample(nrow(df),size=size),]
}
result <- DT[, sampleGroup(.SD, 10), by=y]
print(result)
# y x y
# 1: a 30.11659 m
# 2: a 57.99974 h
# 3: a 58.13634 o
# 4: a 87.28466 x
# 5: a 85.54986 j
# ---
# 256: z 149.85817 d
# 257: z 160.24293 e
# 258: z 26.63071 j
# 259: z 17.00083 t
# 260: z 130.27796 f
system.time(DT[, sampleGroup(.SD, 10), by=y])
# user system elapsed
# 0.66 0.02 0.69
Using the iris dataset:
iris <- data.table(iris)
iris[,sampleGroup(.SD, 10), by=Species]
Solution 3
That's a good question! Can't see any easy way to do it with the documented syntax for dplyr
but how about this for a workaround?
sampleGroup<-function(df,x=1){
df[
unlist(lapply(attr((df),"indices"),function(r)sample(r,min(length(r),x))))
,]
}
sampleGroup(iris %.% group_by(Species),3)
#Source: local data frame [9 x 5]
#Groups: Species
#
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#39 4.4 3.0 1.3 0.2 setosa
#16 5.7 4.4 1.5 0.4 setosa
#25 4.8 3.4 1.9 0.2 setosa
#51 7.0 3.2 4.7 1.4 versicolor
#62 5.9 3.0 4.2 1.5 versicolor
#59 6.6 2.9 4.6 1.3 versicolor
#148 6.5 3.0 5.2 2.0 virginica
#103 7.1 3.0 5.9 2.1 virginica
#120 6.0 2.2 5.0 1.5 virginica
EDIT - PERFORMANCE COMPARISON
Here's a test against using data.table (both native and with a function call as per the example) for 1m rows, 26 groups.
Native data.table is about 2x as fast as the dplyr workaround and also than data.table call with callout. So probably dplyr / data.table are about the same performance.
Hopefully the dplyr guys will give us some native syntax for sampling soon! (or even better, maybe it's already there)
sampleGroup.dt<-function(df,size) {
df[sample(nrow(df),size=size),]
}
testdata<-data.frame(group=sample(letters,10e5,T),runif(10e5))
dti<-data.table(testdata)
# using the dplyr workaround with external function call
system.time(sampleGroup(testdata %.% group_by(group),10))
#user system elapsed
#0.07 0.00 0.06
#using native data.table
system.time(dti[dti[,list(val=sample(.I,10)),by="group"]$val])
#user system elapsed
#0.04 0.00 0.03
#using data.table with external function call
system.time(dti[, sampleGroup.dt(dti, 10), by=group])
#user system elapsed
#0.06 0.02 0.08
Solution 4
Dplyr 1.0.2 can subset with various verbs now: https://dplyr.tidyverse.org/reference/slice.html including random slice_sample:
mtcars %>%
slice_sample(n = 10)
and add a group by to sample by a category:
mtcars %>%
group_by(cyl) %>%
slice_sample(n = 2)
Robert
Updated on July 30, 2022Comments
-
Robert over 1 year
If I want to randomly select some samples from different groups I use the plyr package and the code below
require(plyr) sampleGroup<-function(df,size) { df[sample(nrow(df),size=size),] } iris.sample<-ddply(iris,.(Species),function(df) sampleGroup(df,10))
Here 10 samples are selected from each species.
Some of my dataframes are very big and my question is can I use the same sampleGroup function with the dplyr package? Or is there another way to do the same in dplyr?
EDIT
Version 0.2 of the dplyr package introduced two new functions to select random rows from a table sample_n and sample_frac
-
Troy over 10 years+1 for data.table. Using
.I
doubles the performance speed:iris[iris[,list(idx=sample(.I,10)),by="Species"]$idx]
-
marbel over 10 years+1 for Troy´s anser using data.table in the rigth way. My answer is probably slower because it copies two times the table.
-
Arun over 10 years+1 very nice comparisons. But I don't understand your reason for the last benchmark? You're sample the whole data for 10 elements for every group. Whereas you're doing something with
attributes
fordplyr
case.. Why not benchmark the same fordplyr
with a function similar to the 3rd case forDT
as well? -
Arun over 10 yearsAlso, an important aspect of benchmarking is to see how well it scales. With just 26 groups to aggregate by, there'll be no real difference one can detect. Change your line to
testdata<-data.frame(group=sample(paste("id", 1:1e5, sep=""),10e5,T),runif(10e5))
and run your benchmarks again -
eddi over 10 yearsI think you want
sampleGroup(.SD, 10)
(note.SD
instead ofDT
) -
Romain Francois about 10 yearsPlease note that the internals of dplyr (e.g. the
indices
attributes) are likely to evolve. Don't rely on their structure. -
PhilChang about 10 years@Arun , Yes, But you should update the dplyr to the newest version 0.1.3.0.99.
-
PhilChang about 10 years@Arun,sorry, you should use sample_n()
-
Brani almost 10 yearsIs there a way to do this without using
do
? -
gregmacfarlane over 9 yearsCan you clock your stuff against the
data.table
solutions above? I stay indplyr
as much as I can because the grammar is easier (or at least I haven't learneddata.table
yet). It kind of drives me crazy that everydplyr
question on SO gets adata.table
answer, so I would like to see if this new code gets close. -
marbel about 8 years@gregmacfarlane Just read the comments above and it will make sense. There wasn't an acceptable way to do this with
dplyr
at the time. After reading the current docs at the time, the OP answered: " Thanks, but I think the solution to this problem is not in the documentation yet. Nice solution with data.table though! – Robert". Also read the other answers from the time the question was ask, they don't look like amazing solutions... -
user3614783 about 6 years@PhilChang I get this error message when I run the following code: clickers %>% group_by(ListName)%>% sample_n(200) Error:
size
must be less or equal than 29 (size of data), setreplace
= TRUE to use sampling with replacement -
Bastien about 5 years@user3614783, use
sample_n(min(n(),200)
. The problem is that some of your groups are not 200 row longs.