SQL random sample with groups

20,301

Solution 1

You want a stratified sample. I would recommend doing this by sorting the data by course code and doing an nth sample. Here is one method that works best if you have a large population size:

select d.*
from (select d.*,
             row_number() over (order by coursecode, newid) as seqnum,
             count(*) over () as cnt
      from degree d
     ) d
where seqnum % (cnt / 500) = 1;

EDIT:

You can also calculate the population size for each group "on the fly":

select d.*
from (select d.*,
             row_number() over (partition by coursecode order by newid) as seqnum,
             count(*) over () as cnt,
             count(*) over (partition by coursecode) as cc_cnt
      from degree d
     ) d
where seqnum < 500 * (cc_cnt * 1.0 / cnt)

Solution 2

Add a table for storing population.

I think it should be like this:

SELECT *
FROM (
    SELECT id, coursecode, ROW_NUMBER() OVER (PARTITION BY coursecode ORDER BY NEWID()) AS rn
    FROM degree) t
    LEFT OUTER JOIN
    population p ON t.coursecode = p.coursecode
WHERE
    rn <= p.SampleSize

Solution 3

It is not necessary to partition the population at all.

If you are taking a sample of 1000 from a population among hundreds of course codes, it stands to reason that many of those course codes will not be selected at all in any one sampling.

If the population is uniform (say, a continuous sequence of student IDs), a uniformly-distributed sample will automatically be representative of population weighting by course code. Since newid() is a uniform random sampler, you're good to go out of the box.

The only wrinkle that you might encounter is if a student ID is a associated with multiple course codes. In this case make a unique list (temporary table or subquery) containing a sequential id, student id and course code, sample the sequential id from it, grouping by student id to remove duplicates.

Share:
20,301
Simon
Author by

Simon

Updated on July 09, 2022

Comments

  • Simon
    Simon almost 2 years

    I have a university graduate database and would like to extract a random sample of data of around 1000 records.

    I want to ensure the sample is representative of the population so would like to include the same proportions of courses eg

    enter image description here

    I could do this using the following:

    select top 500 id from degree where coursecode = 1 order by newid()
    union
    select top 300 id from degree where coursecode = 2 order by newid()
    union
    select top 200 id from degree where coursecode = 3 order by newid()
    

    but we have hundreds of courses codes so this would be time consuming and I would like to be able to reuse this code for different sample sizes and don't particularly want to go through the query and hard code the sample sizes.

    Any help would be greatly appreciated