Generating m distinct random numbers in the range [0..n-1]

32,565

Solution 1

Pure mathematics:
Let's calculate the quantity of rand() function calls in both cases and compare the results:

Case 1: let's see the mathematical expectation of calls on step i = k, when you already have k numbers chosen. The probability to get a number with one rand() call is equal to p = (n-k)/n. We need to know the mathematical expectation of such calls quantity which leads to obtaining a number we don't have yet.

The probability to get it using 1 call is p. Using 2 calls - q * p, where q = 1 - p. In general case, the probability to get it exactly after n calls is (q^(n-1))*p. Thus, the mathematical expectation is
Sum[ n * q^(n-1) * p ], n = 1 --> INF. This sum is equal to 1/p (proved by wolfram alpha).

So, on the step i = k you will perform 1/p = n/(n-k) calls of the rand() function.

Now let's sum it overall:

Sum[ n/(n - k) ], k = 0 --> m - 1 = n * T - the number of rand calls in method 1.
Here T = Sum[ 1/(n - k) ], k = 0 --> m - 1

Case 2:

Here rand() is called inside random_shuffle n - 1 times (in most implementations).

Now, to choose the method, we have to compare these two values: n * T ? n - 1.
So, to choose the appropriate method, calculate T as described above. If T < (n - 1)/n it's better to use the first method. Use the second method otherwise.

Solution 2

Check the Wikipedia description of the original Fisher-Yates algorithm. It advocates using essentially your method 1 for up to n/2, and your method 2 for the remainder.

Solution 3

Personally, I would use Method 1, and then if M > N/2, choose N-M values, and then invert the array (return the numbers that were not picked). So for example, if N is 1000 and you want 950 of them, chose 50 values using Method 1, and then return the other 950.

Edit: Though, if consistent performance is your goal, I would use a modified method 2, which doesn't do the full shuffle, but only shuffles the first M elements of your N length array.

int arr[n];
for(int i = 0; i < n; ++i)
    arr[i] = i;

for (int i =0; i < m; ++i) {
   int j = rand(n-i); // Pick random number from 0 <= r < n-i.  Pick favorite method
   // j == 0 means don't swap, otherwise swap with the element j away
   if (j != 0) { 
      std::swap(arr[i], arr[i+j]);
   }
}
result = first m elements in arr;

Solution 4

Here's an algorithm that will work in O(n) memory and O(n) time (where n is the number of returned results, not the size of the set you're selecting from) for any result set. It's in Python for convenience because it uses a hashtable:

def random_elements(num_elements, set_size):
    state = {}
    for i in range(num_elements):
        # Swap state[i] with a random element
        swap_with = random.randint(i, set_size - 1)
        state[i], state[swap_with] = state.get(swap_with, swap_with), state.get(i, i)
    return [state[i] for i in range(num_elements) # effectively state[:num_elements] if it were a list/array.

This is just a partial fisher-yates shuffle, with the array being shuffled implemented as a sparse hashtable - any element that is not present is equal to its index. We shuffle the first num_elements indices, and return those values. In the case that set_size = 1, this is equivalent to picking a random number in the range, and in the case that num_elements = set_size, this is equivalent to a standard fisher-yates shuffle.

It's trivial to observe that this is O(n) time, and because each iteration of the loop initializes at most two new indices in the hashtable, it's O(n) space, too.

Solution 5

What about a third method?

int result[m];
for(i = 0; i < m; ++i)
{
   int r;
   r = rand()%(n-i);
   r += (number of items in result <= r)
   result[i] = r;   
}

Edit it should be <=. and it would actually additional logic to avoid collisions.

This is better, an example using the Modern Method from Fisher-Yates

//C++-ish pseudocode
int arr[n];
for(int i = 0; i < n; ++i)
    arr[i] = i;

for(i = 0; i < m; ++i)
    swap(arr, n-i, rand()%(n-i) );

result = last m elements in arr;
Share:
32,565
Armen Tsirunyan
Author by

Armen Tsirunyan

Updated on July 09, 2022

Comments

  • Armen Tsirunyan
    Armen Tsirunyan almost 2 years

    I have two methods of generating m distinct random numbers in the range [0..n-1]

    Method 1:

    //C++-ish pseudocode
    int result[m];
    for(i = 0; i < m; ++i)
    {
       int r;
       do
       {
          r = rand()%n;
       }while(r is found in result array at indices from 0 to i)
       result[i] = r;   
    }
    

    Method 2:

    //C++-ish pseudocode
    int arr[n];
    for(int i = 0; i < n; ++i)
        arr[i] = i;
    random_shuffle(arr, arr+n);
    result = first m elements in arr;
    

    The first method is more efficient when n is much larger than m, whereas the second is more efficient otherwise. But "much larger" isn't that strict a notion, is it? :)

    Question: What formula of n and m should I use to determine whether method1 or method2 will be more efficient? (in terms of mathematical expectation of the running time)