best way to pick a random subset from a collection?

java algorithm collections random subset

37,166

Solution 1

Jon Bentley discusses this in either 'Programming Pearls' or 'More Programming Pearls'. You need to be careful with your N of M selection process, but I think the code shown works correctly. Rather than randomly shuffle all the items, you can do the random shuffle only shuffling the first N positions - which is a useful saving when N << M.

Knuth also discusses these algorithms - I believe that would be Vol 3 "Sorting and Searching", but my set is packed pending a move of house so I can't formally check that.

Solution 2

@Jonathan,

I believe this is the solution you're talking about:

void genknuth(int m, int n)
{    for (int i = 0; i < n; i++)
         /* select m of remaining n-i */
         if ((bigrand() % (n-i)) < m) {
             cout << i << "\n";
             m--;
         }
}

It's on page 127 of Programming Pearls by Jon Bentley and is based off of Knuth's implementation.

EDIT: I just saw a further modification on page 129:

void genshuf(int m, int n)
{    int i,j;
     int *x = new int[n];
     for (i = 0; i < n; i++)
         x[i] = i;
     for (i = 0; i < m; i++) {
         j = randint(i, n-1);
         int t = x[i]; x[i] = x[j]; x[j] = t;
     }
     sort(x, x+m);
     for (i = 0; i< m; i++)
         cout << x[i] << "\n";
}

This is based on the idea that "...we need shuffle only the first m elements of the array..."

Solution 3

If you're trying to select k distinct elements from a list of n, the methods you gave above will be O(n) or O(kn), because removing an element from a Vector will cause an arraycopy to shift all the elements down.

Since you're asking for the best way, it depends on what you are allowed to do with your input list.

If it's acceptable to modify the input list, as in your examples, then you can simply swap k random elements to the beginning of the list and return them in O(k) time like this:

public static <T> List<T> getRandomSubList(List<T> input, int subsetSize)
{
    Random r = new Random();
    int inputSize = input.size();
    for (int i = 0; i < subsetSize; i++)
    {
        int indexToSwap = i + r.nextInt(inputSize - i);
        T temp = input.get(i);
        input.set(i, input.get(indexToSwap));
        input.set(indexToSwap, temp);
    }
    return input.subList(0, subsetSize);
}

If the list must end up in the same state it began, you can keep track of the positions you swapped, and then return the list to its original state after copying your selected sublist. This is still an O(k) solution.

If, however, you cannot modify the input list at all and k is much less than n (like 5 from 100), it would be much better not to remove selected elements each time, but simply select each element, and if you ever get a duplicate, toss it out and reselect. This will give you O(kn / (n-k)) which is still close to O(k) when n dominates k. (For example, if k is less than n / 2, then it reduces to O(k)).

If k not dominated by n, and you cannot modify the list, you might as well copy your original list, and use your first solution, because O(n) will be just as good as O(k).

As others have noted, if you are depending on strong randomness where every sublist is possible (and unbiased), you'll definitely need something stronger than java.util.Random. See java.security.SecureRandom.

Solution 4

I wrote an efficient implementation of this a few weeks back. It's in C# but the translation to Java is trivial (essentially the same code). The plus side is that it's also completely unbiased (which some of the existing answers aren't) - a way to test that is here.

It's based on a Durstenfeld implementation of the Fisher-Yates shuffle.

Solution 5

Your second solution of using Random to pick element seems sound, however:

Depending on how sensitive your data is, I suggest using some sort of hashing method to scramble the random number seed. For a good case study, see How We Learned to Cheat at Online Poker (but this link is 404 as of 2015-12-18). Alternative URLs (found via a Google search on the article title in double quotes) include:
- How We Learned to Cheat at Online Poker — apparently the original publisher.
- How We Learned to Cheat at Online Poker
- How We Learned to Cheat at Online Poker
Vector is synchronized. If possible, use ArrayList instead to improve performance.

View more solutions

37,166

Author by

SimonC

Updated on July 05, 2022

Comments

SimonC almost 2 years
I have a set of objects in a Vector from which I'd like to select a random subset (e.g. 100 items coming back; pick 5 randomly). In my first (very hasty) pass I did an extremely simple and perhaps overly clever solution:
```
Vector itemsVector = getItems();

Collections.shuffle(itemsVector);
itemsVector.setSize(5);
```
While this has the advantage of being nice and simple, I suspect it's not going to scale very well, i.e. Collections.shuffle() must be O(n) at least. My less clever alternative is
```
Vector itemsVector = getItems();

Random rand = new Random(System.currentTimeMillis()); // would make this static to the class    

List subsetList = new ArrayList(5);
for (int i = 0; i < 5; i++) {
     // be sure to use Vector.remove() or you may get the same item twice
     subsetList.add(itemsVector.remove(rand.nextInt(itemsVector.size())));
}
```
Any suggestions on better ways to draw out a random subset from a Collection?
SimonC over 15 years

Thanks for the tip on using a better seed; I'll check out the link you've posted. Completely agree about using ArrayList vs. Vector; however, this is a 3rd-party library returning the Vector and I have no control over the datatype being returned. Thanks!
Jonathan Leffler over 15 years

O(5N) === O(N); that's the point of big-O notation. However, when you have two methods, both of O(N), then the constant multiplier and the constant addition terms become significant (and any relevant sub-linear terms).
Alexander over 15 years

+1 for beating me to the answer. I was also writing about performing the random shuffle for the first five steps: choose random number from 1 to M, swap the first element with the element at that index, choose a random number from 2 to M, swap second element, and so forth.
Pyrolistical over 15 years

LOL, I need to fix my shuffle code now...I was using System.nanoTime() as my seed as well! Thanks for the great article.
qualidafial over 15 years

Great article. One takeaway that I think can be used to improve the code in the original question is to swap elements instead of removing them. This saves the performance penalty from having to collapse the list when the element is removed.
Dave L. over 15 years

It's sound, but not the best way to do it. It is slower than it needs to be.
Dave L. over 15 years

These are decent, but not the best way. It can be done in O(k).
Tyler over 15 years

These don't mess with the original array. I haven't seen any solutions that do as well without manipulating the original array.
Dave L. over 15 years

I've added such a solution above. So long as k is considerably less than n, you're better off just selecting random elements from the list, and throwing out dupes until you get k.
Tyler over 15 years

That is a practically useful algorithm esp if you use a hash set to check for collisions quickly. But from theoretical analysis, the worst-case is actually O(infinity) because you have no guaranteed limit on # of collisions; a nonhashed version still takes O(log k) per collision check=k log k total.
Dave L. over 15 years

Indeed, you clearly should use a hashed set to check for collisions. Since we're dealing with a randomized algorithm, it's important to analyze the complexity for the worst case over the input, but the expected case over the random values.
SimonC over 15 years

Thanks to everybody for providing all the great info. While they all had great things to add, I'm picking this because it's probably the way I'll refactor the code: * set i = 0 * grab random element r from i to n * swap element @ i with element @ r * i++ * repeat until I've got the ones I want
Jean-Philippe Pellet about 13 years

If would be great if it didn't have a probabilistic run-time, which increases a lot when n gets closer to the size of the collection…
Russia Must Remove Putin almost 7 years

This is close to a link-only answer - could someone please update this with the relevant code?
Antoine over 5 years

The link is broken.
Jayrassic about 4 years

Thanks to the power of the WaybackMachine, you can find a copy here