"order by newid()" - how does it work?

68,701

Solution 1

I know what NewID() does, I'm just trying to understand how it would help in the random selection. Is it that (1) the select statement will select EVERYTHING from mytable, (2) for each row selected, tack on a uniqueidentifier generated by NewID(), (3) sort the rows by this uniqueidentifier and (4) pick off the top 100 from the sorted list?

Yes. this is pretty much exactly correct (except it doesn't necessarily need to sort all the rows). You can verify this by looking at the actual execution plan.

SELECT TOP 100 * 
FROM master..spt_values 
ORDER BY NEWID()

The compute scalar operator adds the NEWID() column on for each row (2506 in the table in my example query) then the rows in the table are sorted by this column with the top 100 selected.

SQL Server doesn't actually need to sort the entire set from positions 100 down so it uses a TOP N sort operator which attempts to perform the entire sort operation in memory (for small values of N)

Plan

Solution 2

In general it works like this:

  • All rows from mytable is "looped"
  • NEWID() is executed for each row
  • The rows are sorted according to random number from NEWID()
  • 100 first row are selected

Solution 3

The key here is the NEWID function, which generates a globally unique identifier (GUID) in memory for each row. By definition, the GUID is unique and fairly random; so, when you sort by that GUID with the ORDER BY clause, you get a random ordering of the rows in the table. Taking the top 10 percent (or whatever percentage you want) will give you a random sampling of the rows in the table.

NEWID query is proposed; it is simple and works very well for small tables. However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. This causes two problems: The sorting operation usually has a high cost associated with it. Sorting can use a lot of disk I/O and can run for a long time. In the worst-case scenario, tempdb can run out of space. In the best-case scenario, tempdb can take up a large amount of disk space that never will be reclaimed without a manual shrink command. What you need is a way to select rows randomly that will not use tempdb and will not get much slower as the table gets larger. Here is a new idea on how to do that:

SELECT * FROM master..spt_values
  WHERE (ABS(CAST(
  (BINARY_CHECKSUM(*) *
  RAND()) as int)) % 100) < 10

The basic idea behind this query is that we want to generate a random number between 0 and 99 for each row in the table, and then choose all of those rows whose random number is less than the value of the specified percent. In this example, we want approximately 10 percent of the rows selected randomly; therefore, we choose all of the rows whose random number is less than 10.

Solution 4

as MSDN says:

NewID() Creates a unique value of type uniqueidentifier.

and your table will be sorted by this random values.

Share:
68,701
Tola Odejayi
Author by

Tola Odejayi

Updated on July 09, 2022

Comments

  • Tola Odejayi
    Tola Odejayi almost 2 years

    I know that If I run this query

    select top 100 * from mytable order by newid()
    

    it will get 100 random records from my table.

    However, I'm a bit confused as to how it works, since I don't see newid() in the select list. Can someone explain? Is there something special about newid() here?

  • Tola Odejayi
    Tola Odejayi about 13 years
    Thanks - I know what NewID() does, I'm just trying to understand how it would help in the random selection. Is it that [1] the select statement will select EVERYTHING from mytable, [2] for each row selected, tack on a uniqueidentifier generated by NewID(), [3] sort the rows by this uniqueidentifier and [4] pick off the top 100 from the sorted list?
  • Tola Odejayi
    Tola Odejayi about 13 years
    Got it! And yes, you're right - once I've determined the top 100 rows from the entire set, there's no need to sort the rest.
  • Diego Jancic
    Diego Jancic about 8 years
    Good idea, but it would only work if you are trying to do a TOP N%. If you do a regular TOP 500 for example it won't work. Unless you have a different way to that (which I would love to hear). Thanks.
  • K09P
    K09P about 5 years
    So, is it safe to ensure that no data is written? Since this is a SELECT query, the NEWID() will calculate a randomized identifier just for the query, it won't be updating anything in the database with this new id, right?
  • Martin Smith
    Martin Smith about 5 years
    Yes it won't affect the tables you are selecting from. At least some of the data will be temporarily written to a worktable in tempdb to hold at least the TOP N results but nothing written to the user database