Sample random rows in dataframe

557,773

Solution 1

First make some data:

> df = data.frame(matrix(rnorm(20), nrow=10))
> df
           X1         X2
1   0.7091409 -1.4061361
2  -1.1334614 -0.1973846
3   2.3343391 -0.4385071
4  -0.9040278 -0.6593677
5   0.4180331 -1.2592415
6   0.7572246 -0.5463655
7  -0.8996483  0.4231117
8  -1.0356774 -0.1640883
9  -0.3983045  0.7157506
10 -0.9060305  2.3234110

Then select some rows at random:

> df[sample(nrow(df), 3), ]
           X1         X2
9  -0.3983045  0.7157506
2  -1.1334614 -0.1973846
10 -0.9060305  2.3234110

Solution 2

The answer John Colby gives is the right answer. However if you are a dplyr user there is also the answer sample_n:

sample_n(df, 10)

randomly samples 10 rows from the dataframe. It calls sample.int, so really is the same answer with less typing (and simplifies use in the context of magrittr since the dataframe is the first argument).

Solution 3

The data.table package provides the function DT[sample(.N, M)], sampling M random rows from the data table DT.

library(data.table)
set.seed(10)

mtcars <- data.table(mtcars)
mtcars[sample(.N, 6)]

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1: 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
2: 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
3: 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
4: 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
5: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
6: 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2

Solution 4

Write one! Wrapping JC's answer gives me:

randomRows = function(df,n){
   return(df[sample(nrow(df),n),])
}

Now make it better by checking first if n<=nrow(df) and stopping with an error.

Solution 5

Just for completeness sake:

dplyr also offers to draw a proportion or fraction of the sample by

df %>% sample_frac(0.33)

This is very convenient e.g. in machine learning when you have to do a certain split ratio like 80%:20%

Share:
557,773
nikhil
Author by

nikhil

nothing much...

Updated on November 16, 2021

Comments

  • nikhil
    nikhil over 2 years

    I am struggling to find the appropriate function that would return a specified number of rows picked up randomly without replacement from a data frame in R language? Can anyone help me out?

  • joran
    joran over 12 years
    @nikhil See here and here for starters. You can also type ?sample in the R console to read about that function.
  • a different ben
    a different ben over 10 years
    What is unexpected about its treatment of data frames?
  • krlmlr
    krlmlr over 10 years
    @adifferentben: When I call sample.default(df, ...) for a data frame df, it samples from the columns of the data frame, as a data frame is implemented as a list of vectors of the same length.
  • terdon
    terdon over 10 years
    Is your package still available? I ran install_github('kimisc', 'krlmlr') and got Error: Does not appear to be an R package (no DESCRIPTION). Any way around that?
  • terdon
    terdon over 10 years
    Sorry to bug you again but since you wrote this (great) package, do you think you could comment on this?
  • Joris Meys
    Joris Meys over 10 years
    @krlmlr I don't agree with you. Nice functionality in your package, but sample() works on a data frame as expected. You confuse a data frame with a matrix. It's not. It's a list. It's indeed not intuitive to see it that way, but that's because far too many people never realized a data frame is a list. Also note that installing your package may break other code dependent on the original behaviour of sample().
  • krlmlr
    krlmlr over 10 years
    @JorisMeys: Agreed, except for the "as expected" part. Just because a data frame is implemented as a list internally, it doesn't mean it should behave as one. The [ operator for data frames is a counterexample. Also, please tell me: Have you ever, just one single time, used sample to sample columns from a data frame?
  • Joris Meys
    Joris Meys over 10 years
    @krlmlr The [ operator is not a counterexample: iris[2] works like a list, as does iris[[2]]. Or iris$Species, lapply(iris, mean), ... Data frames are lists. So I expect them to behave like them. And yes, I have actually used sample(myDataframe). On a dataset where every variable contains expression data of a single gene. Your specific method helps novice users, but also effectively changing the way sample()behaves. Note I use "as expected" from a programmer's view. Which is different from the general intuition. There's a lot in R that's not compatible with general intuition... ;)
  • krlmlr
    krlmlr over 10 years
    @JorisMeys: Fair enough. I was wrong assuming that no one would ever use sample(dataframe)... I'll change the function name to sample.rows and not use it as S3 method. -- Concerning [, I was referring to the myList[i, j] syntax.
  • stackoverflowuser2010
    stackoverflowuser2010 over 10 years
    Can someone explain why sample(df,3) does not work? Why do you need df[sample(nrow(df), 3), ]?
  • stackoverflowuser2010
    stackoverflowuser2010 over 10 years
    I found this StackOverflow question because I'm new to R, and I just tried sample(dataframe), resulting in unexpected bizarreness. I agree with krlmir here. Why does sample(dataframe, 3) not give me 3 random rows from dataframe?
  • krlmlr
    krlmlr over 10 years
    @stackoverflowuser2010: See the updated version of this answer for a solution.
  • David Braun
    David Braun over 10 years
    @stackoverflowuser2010, you can type ?sample and see that the first argument in the sample function must be a vector or a positive integer. I don't think a data.frame works as a vector in this case.
  • CousinCocaine
    CousinCocaine about 10 years
    Remember to set your seed (e.g. set.seed(42) ) every time you want to reproduce that specific sample.
  • Ari B. Friedman
    Ari B. Friedman over 9 years
    sample.int would be slightly faster I believe: library(microbenchmark);microbenchmark( sample( 10000, 100 ), sample.int( 10000, 100 ), times = 10000 )
  • Roger Filmyer
    Roger Filmyer over 9 years
    @stackoverflowuser2010 On a data frame, sample selects random columns (eg your variables) instead of random rows (your observations). So you have to sample row indexes instead of the data frame.
  • user2113499
    user2113499 over 8 years
    Is there a way to have the random rows be consecutive?
  • Davide Piffer
    Davide Piffer about 5 years
    I want to apply this function n (say 1000) times to a dataframe to randomly extract a specified number of rows (with replacement)n times. That is, I want to repeat this function n times (with replacement) to get n random subsets. How do I do it?
  • John Colby
    John Colby about 5 years
    @DavidePiffer replicate(1000, df[sample(nrow(df), 3), ], simplify=FALSE)
  • mLstudent33
    mLstudent33 about 4 years
    Is this with replacement or without?
  • Matt_B
    Matt_B over 3 years
    As of dplyr 1.0.0, sample_n (and sample_frac) have been superseded by slice_sample, though they remain for now.
  • user11130854
    user11130854 over 3 years
    This appears to sample without replacement, and hence also outputs a sample of size min(nrow(df), 10), so this might not be what is needed.
  • 0Knowledge
    0Knowledge about 3 years
    @JohnColby If I want to save 2 dataframe (1 for randomly selected and 2nd for the rest of the row of the dataframe) then how, I have to write? THat's mean, for the row number (1, 3, 4, 5, 6, 7, 8) how I will save them?
  • 0Knowledge
    0Knowledge about 3 years
    Suppose, I have 1000 rows in my df. After applying your code 100 rows will be selected randomly and then how I can store the rest of the 900 rows (which one did not select randomly)?
  • Leopoldo Sanczyk
    Leopoldo Sanczyk about 3 years
    @Akib62 try (rest_of_diamonds <- diamonds[which(!diamonds %in% sample_of_diamonds)])
  • 0Knowledge
    0Knowledge about 3 years
    Not working. When I am using your code (given in the comment) getting the same output as the diamonds or main dataset.
  • Leopoldo Sanczyk
    Leopoldo Sanczyk about 3 years
    @Akib62 since that selects the elements not in sample_of_diamonds, can you confirm sample_of_diamonds is not empty? That could explain your problem.
  • 0Knowledge
    0Knowledge about 3 years
    Say, I have 20 rows in my dataset. So when I am applying sample_of_diamonds <- diamonds[sample(nrow(diamonds),10),] I am getting 10 rows randomly and rest_of_diamonds <- diamonds[which(!diamonds %in% sample_of_diamonds)] I am getting 20 rows (main dataset)
  • Leopoldo Sanczyk
    Leopoldo Sanczyk about 3 years
    @Akib62 I guess you checked it, and those 10 rows in sample_of.. dataset are efectively inside the rest_of... dataset. That's weird, because the line says explicitly to ignore those in the main dataset. Could be the format, some type casting? Did you try to check that, or compare the content of some row in both sets (coding)?
  • somehume
    somehume about 3 years
    Besides not working otherwise, why is it necessary to have the comma after "3)"?
  • quickshiftin
    quickshiftin about 2 years
    A note from ?sample_frac: "[Superseded] ‘sample_n()’ and ‘sample_frac()’ have been superseded in favour of ‘slice_sample()’"