Select random sampling from sqlserver quickly

28,831

Solution 1

If you can use a pseudo-random sampling and you're on SQL Server 2005/2008, then take a look at TABLESAMPLE. For instance, an example from SQL Server 2008 / AdventureWorks 2008 which works based on rows:

USE AdventureWorks2008; 
GO 


SELECT FirstName, LastName
FROM Person.Person 
TABLESAMPLE (100 ROWS)
WHERE EmailPromotion = 2;

The catch is that TABLESAMPLE isn't exactly random as it generates a given number of rows from each physical page. You may not get back exactly 5000 rows unless you limit with TOP as well. If you're on SQL Server 2000, you're going to have to either generate a temporary table which match the primary key or you're going to have to do it using a method using NEWID().

Solution 2

Have you looked into using the TABLESAMPLE clause?

For example:

select *
from HumanResources.Department tablesample (5 percent)

Solution 3

SQL Server 2000 Solution, regarding to Microsoft (instead of slow NEWID() on larger Tables):

SELECT * FROM Table1
WHERE (ABS(CAST(
 (BINARY_CHECKSUM(*) *
  RAND()) as int)) % 100) < 10

The SQL Server team at Microsoft realized that not being able to take random samples of rows easily was a common problem in SQL Server 2000; so, the team addressed the problem in SQL Server 2005 by introducing the TABLESAMPLE clause. This clause selects a subset of rows by choosing random data pages and returning all of the rows on those pages. However, for those of us who still have products that run on SQL Server 2000 and need backward-compatibility, or who need truly row-level randomness, the BINARY_CHECKSUM query is a very effective workaround.

Explanation can be found here: http://msdn.microsoft.com/en-us/library/cc441928.aspx

Solution 4

Yeah, tablesample is your friend (note that it's not random in the statistical sense of the word): Tablesample at msdn

Share:
28,831

Related videos on Youtube

Byron Whitlock
Author by

Byron Whitlock

Software architect with over 15 years of experience.

Updated on July 09, 2022

Comments

  • Byron Whitlock
    Byron Whitlock almost 2 years

    I have a huge table of > 10 million rows. I need to efficiently grab a random sampling of 5000 from it. I have some constriants that reduces the total rows I am looking for to like 9 millon.

    I tried using order by NEWID(), but that query will take too long as it has to do a table scan of all rows.

    Is there a faster way to do this?

    • user2120901
      user2120901 about 15 years
      are you using some php/asp/ any stuff like that?
    • Byron Whitlock
      Byron Whitlock about 15 years
      Why would it matter? I certainly don't wan the app layer to do this!
  • Byron Whitlock
    Byron Whitlock about 15 years
    We are using sqlserver 2005, but our database compatibility level is at 80, so no tablesample. :( any other ideas?
  • Albert
    Albert about 15 years
    select * from customers order by newid()
  • friism
    friism about 15 years
    Wrong, tablesample works by selecting an appropriate number of pages and then returning all the rows found on those pages. The whole point is avoiding hitting all the pages holding the table.
  • K. Brian Kelley
    K. Brian Kelley about 15 years
    Sorry, you are right. Read the algorithm wrong. It determines the # of rows and then selects the entire page or not to get the approxmate #.
  • Manuel Castro
    Manuel Castro over 11 years
    This issue was such a mess that Microsoft had to make this native TABLESAMPLE implementation and it's the most stable and efficient in all scenarios
  • Marc Wittke
    Marc Wittke almost 7 years
    Sidenote: You're applying the where clause on the already truncated sample. So don't expect it return matching rows under all circumstances.