best way to pick a random subset from a collection?

37,166

Solution 1

Jon Bentley discusses this in either 'Programming Pearls' or 'More Programming Pearls'. You need to be careful with your N of M selection process, but I think the code shown works correctly. Rather than randomly shuffle all the items, you can do the random shuffle only shuffling the first N positions - which is a useful saving when N << M.

Knuth also discusses these algorithms - I believe that would be Vol 3 "Sorting and Searching", but my set is packed pending a move of house so I can't formally check that.

Solution 2

@Jonathan,

I believe this is the solution you're talking about:

void genknuth(int m, int n)
{    for (int i = 0; i < n; i++)
         /* select m of remaining n-i */
         if ((bigrand() % (n-i)) < m) {
             cout << i << "\n";
             m--;
         }
}

It's on page 127 of Programming Pearls by Jon Bentley and is based off of Knuth's implementation.

EDIT: I just saw a further modification on page 129:

void genshuf(int m, int n)
{    int i,j;
     int *x = new int[n];
     for (i = 0; i < n; i++)
         x[i] = i;
     for (i = 0; i < m; i++) {
         j = randint(i, n-1);
         int t = x[i]; x[i] = x[j]; x[j] = t;
     }
     sort(x, x+m);
     for (i = 0; i< m; i++)
         cout << x[i] << "\n";
}

This is based on the idea that "...we need shuffle only the first m elements of the array..."

Solution 3

If you're trying to select k distinct elements from a list of n, the methods you gave above will be O(n) or O(kn), because removing an element from a Vector will cause an arraycopy to shift all the elements down.

Since you're asking for the best way, it depends on what you are allowed to do with your input list.

If it's acceptable to modify the input list, as in your examples, then you can simply swap k random elements to the beginning of the list and return them in O(k) time like this:

public static <T> List<T> getRandomSubList(List<T> input, int subsetSize)
{
    Random r = new Random();
    int inputSize = input.size();
    for (int i = 0; i < subsetSize; i++)
    {
        int indexToSwap = i + r.nextInt(inputSize - i);
        T temp = input.get(i);
        input.set(i, input.get(indexToSwap));
        input.set(indexToSwap, temp);
    }
    return input.subList(0, subsetSize);
}

If the list must end up in the same state it began, you can keep track of the positions you swapped, and then return the list to its original state after copying your selected sublist. This is still an O(k) solution.

If, however, you cannot modify the input list at all and k is much less than n (like 5 from 100), it would be much better not to remove selected elements each time, but simply select each element, and if you ever get a duplicate, toss it out and reselect. This will give you O(kn / (n-k)) which is still close to O(k) when n dominates k. (For example, if k is less than n / 2, then it reduces to O(k)).

If k not dominated by n, and you cannot modify the list, you might as well copy your original list, and use your first solution, because O(n) will be just as good as O(k).

As others have noted, if you are depending on strong randomness where every sublist is possible (and unbiased), you'll definitely need something stronger than java.util.Random. See java.security.SecureRandom.

Solution 4

I wrote an efficient implementation of this a few weeks back. It's in C# but the translation to Java is trivial (essentially the same code). The plus side is that it's also completely unbiased (which some of the existing answers aren't) - a way to test that is here.

It's based on a Durstenfeld implementation of the Fisher-Yates shuffle.

Solution 5

Your second solution of using Random to pick element seems sound, however:

Share:
37,166
SimonC
Author by

SimonC

Updated on July 05, 2022

Comments

  • SimonC
    SimonC almost 2 years

    I have a set of objects in a Vector from which I'd like to select a random subset (e.g. 100 items coming back; pick 5 randomly). In my first (very hasty) pass I did an extremely simple and perhaps overly clever solution:

    Vector itemsVector = getItems();
    
    Collections.shuffle(itemsVector);
    itemsVector.setSize(5);
    

    While this has the advantage of being nice and simple, I suspect it's not going to scale very well, i.e. Collections.shuffle() must be O(n) at least. My less clever alternative is

    Vector itemsVector = getItems();
    
    Random rand = new Random(System.currentTimeMillis()); // would make this static to the class    
    
    List subsetList = new ArrayList(5);
    for (int i = 0; i < 5; i++) {
         // be sure to use Vector.remove() or you may get the same item twice
         subsetList.add(itemsVector.remove(rand.nextInt(itemsVector.size())));
    }
    

    Any suggestions on better ways to draw out a random subset from a Collection?

  • SimonC
    SimonC over 15 years
    Thanks for the tip on using a better seed; I'll check out the link you've posted. Completely agree about using ArrayList vs. Vector; however, this is a 3rd-party library returning the Vector and I have no control over the datatype being returned. Thanks!
  • Jonathan Leffler
    Jonathan Leffler over 15 years
    O(5N) === O(N); that's the point of big-O notation. However, when you have two methods, both of O(N), then the constant multiplier and the constant addition terms become significant (and any relevant sub-linear terms).
  • Alexander
    Alexander over 15 years
    +1 for beating me to the answer. I was also writing about performing the random shuffle for the first five steps: choose random number from 1 to M, swap the first element with the element at that index, choose a random number from 2 to M, swap second element, and so forth.
  • Pyrolistical
    Pyrolistical over 15 years
    LOL, I need to fix my shuffle code now...I was using System.nanoTime() as my seed as well! Thanks for the great article.
  • qualidafial
    qualidafial over 15 years
    Great article. One takeaway that I think can be used to improve the code in the original question is to swap elements instead of removing them. This saves the performance penalty from having to collapse the list when the element is removed.
  • Dave L.
    Dave L. over 15 years
    It's sound, but not the best way to do it. It is slower than it needs to be.
  • Dave L.
    Dave L. over 15 years
    These are decent, but not the best way. It can be done in O(k).
  • Tyler
    Tyler over 15 years
    These don't mess with the original array. I haven't seen any solutions that do as well without manipulating the original array.
  • Dave L.
    Dave L. over 15 years
    I've added such a solution above. So long as k is considerably less than n, you're better off just selecting random elements from the list, and throwing out dupes until you get k.
  • Tyler
    Tyler over 15 years
    That is a practically useful algorithm esp if you use a hash set to check for collisions quickly. But from theoretical analysis, the worst-case is actually O(infinity) because you have no guaranteed limit on # of collisions; a nonhashed version still takes O(log k) per collision check=k log k total.
  • Dave L.
    Dave L. over 15 years
    Indeed, you clearly should use a hashed set to check for collisions. Since we're dealing with a randomized algorithm, it's important to analyze the complexity for the worst case over the input, but the expected case over the random values.
  • SimonC
    SimonC over 15 years
    Thanks to everybody for providing all the great info. While they all had great things to add, I'm picking this because it's probably the way I'll refactor the code: * set i = 0 * grab random element r from i to n * swap element @ i with element @ r * i++ * repeat until I've got the ones I want
  • Jean-Philippe Pellet
    Jean-Philippe Pellet about 13 years
    If would be great if it didn't have a probabilistic run-time, which increases a lot when n gets closer to the size of the collection…
  • Russia Must Remove Putin
    Russia Must Remove Putin almost 7 years
    This is close to a link-only answer - could someone please update this with the relevant code?
  • Antoine
    Antoine over 5 years
    The link is broken.
  • Jayrassic
    Jayrassic about 4 years
    Thanks to the power of the WaybackMachine, you can find a copy here