Pick N distinct items at random from sequence of unknown length, in only one iteration
Solution 1
Use reservoir sampling. It's a very simple algorithm that works for any N
.
Here is one Python implementation, and here is another.
Solution 2
If your sequence is short enough that reading it into memory and randomly sorting it is acceptable, then a straightforward approach would be to just use random.shuffle
:
import random
arr=[1,2,3,4]
# In-place shuffle
random.shuffle(arr)
# Take the first 2 elements of the now randomized array
print arr[0:2]
[1, 3]
Depending upon the type of your sequence, you may need to convert it to a list by calling list(your_sequence)
on it, but this will work regardless of the types of the objects in your sequence.
Naturally, if you can't fit your sequence into memory or the memory or CPU requirements of this approach are too high for you, you will need to use a different solution.
Solution 3
Simplest I've found is this answer in SO, improved a bit below:
import random
my_list = [1, 2, 3, 4, 5]
how_big = 2
new_list = random.sample(my_list, how_big)
# To preserve the order of the list, you could do:
randIndex = random.sample(range(len(my_list)), how_big)
randIndex.sort()
new_list = [my_list[i] for i in randIndex]
Solution 4
If you have python version of 3.6+ you can use choices
from random import choices
items = range(1, 10)
new_items = choices(items, k = 3)
print(new_items)
[6, 3, 1]
Solution 5
Following will give you N random items from an array X
import random
list(map(lambda _: random.choice(X), range(N)))
akonsu
Updated on June 20, 2021Comments
-
akonsu almost 3 years
I am trying to write an algorithm that would pick N distinct items from an sequence at random, without knowing the size of the sequence in advance, and where it is expensive to iterate over the sequence more than once. For example, the elements of the sequence might be the lines of a huge file.
I have found a solution when N=1 (that is, "pick exactly one element at random from a huge sequence"):
import random items = range(1, 10) # Imagine this is a huge sequence of unknown length count = 1 selected = None for item in items: if random.random() * count < 1: selected = item count += 1
But how can I achieve the same thing for other values of N (say, N=3)?
-
Jackson Tale about 9 yearsThe array's size is unknown or not possible to know, and it might be huge. For example, randomly choosing n elements from a 100G stream.
-
Mark Amery almost 9 yearsI like this - it's trivial to see that it works, since you're just generating a random number for every entry in the sequence and taking the top k. Reservoir sampling, on the other hand, looks at first glance like it probably works but it takes a little thought and calculation to prove it.
-
panda-34 about 8 yearsA quick check of
Counter(itertools.chain.from_iterable(sample(iter(range(100)), 5) for x in range(100000)))
shows heavy and consistent bias towards the beginning of the range -
panda-34 about 8 yearsAnd the culprit is using of
randint
instead ofrandrange
-
JesseBuesking about 8 years@panda-34 thanks for the heads up! I updated the answer based on your comments to address the issue.
-
Mark Amery over 7 yearsUsing
randrange
here would be neater than explicitly subtracting 1 from the end of the range just to makerandint
behave likerandrange
. -
Gábor Nagy over 7 yearsThis will not give distinct elements: >>> x = ["a", "b", "c", "d", "e", "f", "g", "h", "i"] >>> list(map(lambda _: random.choice(x), range(3))) ['c', 'a', 'a']
-
Dan about 6 yearsGreat answer, but only available in 3.6+.
-
matanster almost 6 yearsThis should be the top answer I guess
-
Pawan Mishra over 5 yearsBut this may have repetition!
-
Nick K9 about 4 yearsUnlike the other answers, this also works if
k > len(items)
. Just what I needed, thanks! -
Catbuilts over 3 yearsThis should be the selected answer
-
Mehmet Burak Sayıcı over 3 yearsElegant solution