Sample random rows in dataframe
Solution 1
First make some data:
> df = data.frame(matrix(rnorm(20), nrow=10))
> df
X1 X2
1 0.7091409 -1.4061361
2 -1.1334614 -0.1973846
3 2.3343391 -0.4385071
4 -0.9040278 -0.6593677
5 0.4180331 -1.2592415
6 0.7572246 -0.5463655
7 -0.8996483 0.4231117
8 -1.0356774 -0.1640883
9 -0.3983045 0.7157506
10 -0.9060305 2.3234110
Then select some rows at random:
> df[sample(nrow(df), 3), ]
X1 X2
9 -0.3983045 0.7157506
2 -1.1334614 -0.1973846
10 -0.9060305 2.3234110
Solution 2
The answer John Colby gives is the right answer. However if you are a dplyr
user there is also the answer sample_n
:
sample_n(df, 10)
randomly samples 10 rows from the dataframe. It calls sample.int
, so really is the same answer with less typing (and simplifies use in the context of magrittr since the dataframe is the first argument).
Solution 3
The data.table
package provides the function DT[sample(.N, M)]
, sampling M random rows from the data table DT
.
library(data.table)
set.seed(10)
mtcars <- data.table(mtcars)
mtcars[sample(.N, 6)]
mpg cyl disp hp drat wt qsec vs am gear carb
1: 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
2: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
3: 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
4: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
5: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
6: 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
Solution 4
Write one! Wrapping JC's answer gives me:
randomRows = function(df,n){
return(df[sample(nrow(df),n),])
}
Now make it better by checking first if n<=nrow(df) and stopping with an error.
Solution 5
Just for completeness sake:
dplyr also offers to draw a proportion or fraction of the sample by
df %>% sample_frac(0.33)
This is very convenient e.g. in machine learning when you have to do a certain split ratio like 80%:20%
Comments
-
nikhil over 2 years
I am struggling to find the appropriate function that would return a specified number of rows picked up randomly without replacement from a data frame in R language? Can anyone help me out?
-
joran over 12 years
-
a different ben over 10 yearsWhat is unexpected about its treatment of data frames?
-
krlmlr over 10 years@adifferentben: When I call
sample.default(df, ...)
for a data framedf
, it samples from the columns of the data frame, as a data frame is implemented as a list of vectors of the same length. -
terdon over 10 yearsIs your package still available? I ran
install_github('kimisc', 'krlmlr')
and gotError: Does not appear to be an R package (no DESCRIPTION)
. Any way around that? -
terdon over 10 yearsSorry to bug you again but since you wrote this (great) package, do you think you could comment on this?
-
Joris Meys over 10 years@krlmlr I don't agree with you. Nice functionality in your package, but sample() works on a data frame as expected. You confuse a data frame with a matrix. It's not. It's a list. It's indeed not intuitive to see it that way, but that's because far too many people never realized a data frame is a list. Also note that installing your package may break other code dependent on the original behaviour of sample().
-
krlmlr over 10 years@JorisMeys: Agreed, except for the "as expected" part. Just because a data frame is implemented as a list internally, it doesn't mean it should behave as one. The
[
operator for data frames is a counterexample. Also, please tell me: Have you ever, just one single time, usedsample
to sample columns from a data frame? -
Joris Meys over 10 years@krlmlr The [ operator is not a counterexample:
iris[2]
works like a list, as doesiris[[2]]
. Oriris$Species
,lapply(iris, mean)
, ... Data frames are lists. So I expect them to behave like them. And yes, I have actually used sample(myDataframe). On a dataset where every variable contains expression data of a single gene. Your specific method helps novice users, but also effectively changing the waysample()
behaves. Note I use "as expected" from a programmer's view. Which is different from the general intuition. There's a lot in R that's not compatible with general intuition... ;) -
krlmlr over 10 years@JorisMeys: Fair enough. I was wrong assuming that no one would ever use
sample(dataframe)
... I'll change the function name tosample.rows
and not use it as S3 method. -- Concerning[
, I was referring to themyList[i, j]
syntax. -
stackoverflowuser2010 over 10 yearsCan someone explain why sample(df,3) does not work? Why do you need df[sample(nrow(df), 3), ]?
-
stackoverflowuser2010 over 10 yearsI found this StackOverflow question because I'm new to R, and I just tried sample(dataframe), resulting in unexpected bizarreness. I agree with krlmir here. Why does sample(dataframe, 3) not give me 3 random rows from dataframe?
-
krlmlr over 10 years@stackoverflowuser2010: See the updated version of this answer for a solution.
-
David Braun over 10 years@stackoverflowuser2010, you can type ?sample and see that the first argument in the sample function must be a vector or a positive integer. I don't think a data.frame works as a vector in this case.
-
CousinCocaine about 10 yearsRemember to set your seed (e.g.
set.seed(42)
) every time you want to reproduce that specific sample. -
Ari B. Friedman over 9 years
sample.int
would be slightly faster I believe:library(microbenchmark);microbenchmark( sample( 10000, 100 ), sample.int( 10000, 100 ), times = 10000 )
-
Roger Filmyer over 9 years@stackoverflowuser2010 On a data frame, sample selects random columns (eg your variables) instead of random rows (your observations). So you have to sample row indexes instead of the data frame.
-
user2113499 over 8 yearsIs there a way to have the random rows be consecutive?
-
Davide Piffer about 5 yearsI want to apply this function n (say 1000) times to a dataframe to randomly extract a specified number of rows (with replacement)n times. That is, I want to repeat this function n times (with replacement) to get n random subsets. How do I do it?
-
John Colby about 5 years@DavidePiffer
replicate(1000, df[sample(nrow(df), 3), ], simplify=FALSE)
-
mLstudent33 about 4 yearsIs this with replacement or without?
-
Matt_B over 3 yearsAs of dplyr 1.0.0, sample_n (and sample_frac) have been superseded by slice_sample, though they remain for now.
-
user11130854 over 3 yearsThis appears to sample without replacement, and hence also outputs a sample of size min(nrow(df), 10), so this might not be what is needed.
-
0Knowledge about 3 years@JohnColby If I want to save 2 dataframe (1 for randomly selected and 2nd for the rest of the row of the dataframe) then how, I have to write? THat's mean, for the row number (1, 3, 4, 5, 6, 7, 8) how I will save them?
-
0Knowledge about 3 yearsSuppose, I have 1000 rows in my df. After applying your code 100 rows will be selected randomly and then how I can store the rest of the 900 rows (which one did not select randomly)?
-
Leopoldo Sanczyk about 3 years@Akib62 try
(rest_of_diamonds <- diamonds[which(!diamonds %in% sample_of_diamonds)])
-
0Knowledge about 3 yearsNot working. When I am using your code (given in the comment) getting the same output as the
diamonds
ormain dataset
. -
Leopoldo Sanczyk about 3 years@Akib62 since that selects the elements not in
sample_of_diamonds
, can you confirmsample_of_diamonds
is not empty? That could explain your problem. -
0Knowledge about 3 yearsSay, I have 20 rows in my dataset. So when I am applying
sample_of_diamonds <- diamonds[sample(nrow(diamonds),10),]
I am getting10 rows randomly
andrest_of_diamonds <- diamonds[which(!diamonds %in% sample_of_diamonds)]
I am getting20 rows (main dataset)
-
Leopoldo Sanczyk about 3 years@Akib62 I guess you checked it, and those 10 rows in
sample_of..
dataset are efectively inside therest_of...
dataset. That's weird, because the line says explicitly to ignore those in the main dataset. Could be the format, some type casting? Did you try to check that, or compare the content of some row in both sets (coding)? -
somehume about 3 yearsBesides not working otherwise, why is it necessary to have the comma after "3)"?
-
quickshiftin about 2 yearsA note from
?sample_frac
: "[Superseded] ‘sample_n()’ and ‘sample_frac()’ have been superseded in favour of ‘slice_sample()’"