Pick N distinct items at random from sequence of unknown length, in only one iteration

76,180

Solution 1

Use reservoir sampling. It's a very simple algorithm that works for any N.

Here is one Python implementation, and here is another.

Solution 2

If your sequence is short enough that reading it into memory and randomly sorting it is acceptable, then a straightforward approach would be to just use random.shuffle:

import random
arr=[1,2,3,4]

# In-place shuffle
random.shuffle(arr)

# Take the first 2 elements of the now randomized array
print arr[0:2]
[1, 3]

Depending upon the type of your sequence, you may need to convert it to a list by calling list(your_sequence) on it, but this will work regardless of the types of the objects in your sequence.

Naturally, if you can't fit your sequence into memory or the memory or CPU requirements of this approach are too high for you, you will need to use a different solution.

Solution 3

Simplest I've found is this answer in SO, improved a bit below:

import random

my_list = [1, 2, 3, 4, 5]
how_big = 2

new_list = random.sample(my_list, how_big)

# To preserve the order of the list, you could do:
randIndex = random.sample(range(len(my_list)), how_big)
randIndex.sort()
new_list = [my_list[i] for i in randIndex]

Solution 4

If you have python version of 3.6+ you can use choices

from random import choices

items = range(1, 10)
new_items = choices(items, k = 3)

print(new_items) 
[6, 3, 1]

Solution 5

Following will give you N random items from an array X

import random
list(map(lambda _: random.choice(X), range(N)))
Share:
76,180
akonsu
Author by

akonsu

Updated on June 20, 2021

Comments

  • akonsu
    akonsu almost 3 years

    I am trying to write an algorithm that would pick N distinct items from an sequence at random, without knowing the size of the sequence in advance, and where it is expensive to iterate over the sequence more than once. For example, the elements of the sequence might be the lines of a huge file.

    I have found a solution when N=1 (that is, "pick exactly one element at random from a huge sequence"):

    import random
    items = range(1, 10) # Imagine this is a huge sequence of unknown length
    count = 1
    selected = None
    for item in items:
        if random.random() * count < 1:
            selected = item
        count += 1
    

    But how can I achieve the same thing for other values of N (say, N=3)?

  • Jackson Tale
    Jackson Tale about 9 years
    The array's size is unknown or not possible to know, and it might be huge. For example, randomly choosing n elements from a 100G stream.
  • Mark Amery
    Mark Amery almost 9 years
    I like this - it's trivial to see that it works, since you're just generating a random number for every entry in the sequence and taking the top k. Reservoir sampling, on the other hand, looks at first glance like it probably works but it takes a little thought and calculation to prove it.
  • panda-34
    panda-34 about 8 years
    A quick check of Counter(itertools.chain.from_iterable(sample(iter(range(100)‌​), 5) for x in range(100000))) shows heavy and consistent bias towards the beginning of the range
  • panda-34
    panda-34 about 8 years
    And the culprit is using of randint instead of randrange
  • JesseBuesking
    JesseBuesking about 8 years
    @panda-34 thanks for the heads up! I updated the answer based on your comments to address the issue.
  • Mark Amery
    Mark Amery over 7 years
    Using randrange here would be neater than explicitly subtracting 1 from the end of the range just to make randint behave like randrange.
  • Gábor Nagy
    Gábor Nagy over 7 years
    This will not give distinct elements: >>> x = ["a", "b", "c", "d", "e", "f", "g", "h", "i"] >>> list(map(lambda _: random.choice(x), range(3))) ['c', 'a', 'a']
  • Dan
    Dan about 6 years
    Great answer, but only available in 3.6+.
  • matanster
    matanster almost 6 years
    This should be the top answer I guess
  • Pawan Mishra
    Pawan Mishra over 5 years
    But this may have repetition!
  • Nick K9
    Nick K9 about 4 years
    Unlike the other answers, this also works if k > len(items). Just what I needed, thanks!
  • Catbuilts
    Catbuilts over 3 years
    This should be the selected answer
  • Mehmet Burak Sayıcı
    Mehmet Burak Sayıcı over 3 years
    Elegant solution