Random subsampling in R

14,022

Solution 1

What the OP is probably looking for here is a way to bootstrap the data for a Hill or Simpson diversity index, which provides some assumptions about the data being sampled:

  • Each row is a site, each column is a species, and each value is a count.
  • Individuals are being sampled for the bootstrap, NOT THE COUNTS.

To do this, bootstrapping programs will often model the counts as a string of individuals. For instance, if we had a record like so:

a  b  c
2  3  4

The record would be modeled as:

aabbbcccc

Then, a sample is usually drawn WITH replacement from the string to create a larger set based on the model set.

Bootstrapping a site: In R, we have a way to do this that is actually quite simple with the 'sample' function. If you select from the column numbers, you can provide probabilities using the count data.

# Test data.
data <- data.frame(a=2, b=3, c=4)

# Sampling from first row of data.
row <- 1
N_samples <- 50

samples <- sample(1:ncol(data), N_samples, rep=TRUE, prob=data[row,])

Converting the sample into the format of the original table: Now we have an array of samples, with each item indicating the column number that the sample belongs to. We can convert back to the original table format in multiple ways, but here is a fairly simple one using a simple counting loop:

# Count the number of each entry and store in a list.
for (i in 1:ncol(data)){
    site_sample[[i]] <- sum(samples==i)
}

# Unlist the data to get an array that represents the bootstrap row.
site_sample <- unlist(site_sample)

Solution 2

Just stumbled upon this thread, and the vegan package has a function called 'rrarify' that does precisely what you're looking to do (and in the same ecological context, too)

Solution 3

Try something like this:

mysample <- mydata[sample(1:nrow(mydata), 50, replace=FALSE),]
Share:
14,022
Paris St
Author by

Paris St

Updated on June 14, 2022

Comments

  • Paris St
    Paris St almost 2 years

    I am new in R, therefore my question might be really simple. I have a 40 sites with abundances of zooplankton.

    My data looks like this (columns are species abundances and rows are sites)

    0   0   0   0   0   2   0   0   0   
    0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   85  0
    0   0   0   0   0   45  5   57  0
    0   0   0   0   0   13  0   3   0
    0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   7   0
    0   3   0   0   12  8   0   57  0
    0   0   0   0   0   0   0   1   0
    0   0   0   0   0   59  0   0   0
    0   0   0   0   4   0   0   0   0
    0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0
    0   105 0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0
    0   0   0   0   1   0   0   100 0
    0   35  0   55  0   0   0   0   0
    1   4   0   0   0   0   0   0   0
    0   0   0   0   0   34  21  0   0
    0   0   0   0   0   9   17  0   0
    0   54  0   0   0   27  5   0   0
    0   1   0   0   0   1   0   0   0
    0   17  0   0   0   54  3   0   0
    

    What I would like to is take a random sub-sample (e.g. 50 individuals) from each site without replacement several times (bootstrap) in order to calculate diversity indexes to the new standardized abundances afterwards.