Select k random elements from a list whose elements have weights

38,446

Solution 1

I know this is a very old question, but I think there's a neat trick to do this in O(n) time if you apply a little math!

The exponential distribution has two very useful properties.

  1. Given n samples from different exponential distributions with different rate parameters, the probability that a given sample is the minimum is equal to its rate parameter divided by the sum of all rate parameters.

  2. It is "memoryless". So if you already know the minimum, then the probability that any of the remaining elements is the 2nd-to-min is the same as the probability that if the true min were removed (and never generated), that element would have been the new min. This seems obvious, but I think because of some conditional probability issues, it might not be true of other distributions.

Using fact 1, we know that choosing a single element can be done by generating these exponential distribution samples with rate parameter equal to the weight, and then choosing the one with minimum value.

Using fact 2, we know that we don't have to re-generate the exponential samples. Instead, just generate one for each element, and take the k elements with lowest samples.

Finding the lowest k can be done in O(n). Use the Quickselect algorithm to find the k-th element, then simply take another pass through all elements and output all lower than the k-th.

A useful note: if you don't have immediate access to a library to generate exponential distribution samples, it can be easily done by: -ln(rand())/weight

Solution 2

If the sampling is with replacement, you can use this algorithm (implemented here in Python):

import random

items = [(10, "low"),
         (100, "mid"),
         (890, "large")]

def weighted_sample(items, n):
    total = float(sum(w for w, v in items))
    i = 0
    w, v = items[0]
    while n:
        x = total * (1 - random.random() ** (1.0 / n))
        total -= x
        while x > w:
            x -= w
            i += 1
            w, v = items[i]
        w -= x
        yield v
        n -= 1

This is O(n + m) where m is the number of items.

Why does this work? It is based on the following algorithm:

def n_random_numbers_decreasing(v, n):
    """Like reversed(sorted(v * random() for i in range(n))),
    but faster because we avoid sorting."""
    while n:
        v *= random.random() ** (1.0 / n)
        yield v
        n -= 1

The function weighted_sample is just this algorithm fused with a walk of the items list to pick out the items selected by those random numbers.

This in turn works because the probability that n random numbers 0..v will all happen to be less than z is P = (z/v)n. Solve for z, and you get z = vP1/n. Substituting a random number for P picks the largest number with the correct distribution; and we can just repeat the process to select all the other numbers.

If the sampling is without replacement, you can put all the items into a binary heap, where each node caches the total of the weights of all items in that subheap. Building the heap is O(m). Selecting a random item from the heap, respecting the weights, is O(log m). Removing that item and updating the cached totals is also O(log m). So you can pick n items in O(m + n log m) time.

(Note: "weight" here means that every time an element is selected, the remaining possibilities are chosen with probability proportional to their weights. It does not mean that elements appear in the output with a likelihood proportional to their weights.)

Here's an implementation of that, plentifully commented:

import random

class Node:
    # Each node in the heap has a weight, value, and total weight.
    # The total weight, self.tw, is self.w plus the weight of any children.
    __slots__ = ['w', 'v', 'tw']
    def __init__(self, w, v, tw):
        self.w, self.v, self.tw = w, v, tw

def rws_heap(items):
    # h is the heap. It's like a binary tree that lives in an array.
    # It has a Node for each pair in `items`. h[1] is the root. Each
    # other Node h[i] has a parent at h[i>>1]. Each node has up to 2
    # children, h[i<<1] and h[(i<<1)+1].  To get this nice simple
    # arithmetic, we have to leave h[0] vacant.
    h = [None]                          # leave h[0] vacant
    for w, v in items:
        h.append(Node(w, v, w))
    for i in range(len(h) - 1, 1, -1):  # total up the tws
        h[i>>1].tw += h[i].tw           # add h[i]'s total to its parent
    return h

def rws_heap_pop(h):
    gas = h[1].tw * random.random()     # start with a random amount of gas

    i = 1                     # start driving at the root
    while gas >= h[i].w:      # while we have enough gas to get past node i:
        gas -= h[i].w         #   drive past node i
        i <<= 1               #   move to first child
        if gas >= h[i].tw:    #   if we have enough gas:
            gas -= h[i].tw    #     drive past first child and descendants
            i += 1            #     move to second child
    w = h[i].w                # out of gas! h[i] is the selected node.
    v = h[i].v

    h[i].w = 0                # make sure this node isn't chosen again
    while i:                  # fix up total weights
        h[i].tw -= w
        i >>= 1
    return v

def random_weighted_sample_no_replacement(items, n):
    heap = rws_heap(items)              # just make a heap...
    for i in range(n):
        yield rws_heap_pop(heap)        # and pop n items off it.

Solution 3

If the sampling is with replacement, use the roulette-wheel selection technique (often used in genetic algorithms):

  1. sort the weights
  2. compute the cumulative weights
  3. pick a random number in [0,1]*totalWeight
  4. find the interval in which this number falls into
  5. select the elements with the corresponding interval
  6. repeat k times

alt text

If the sampling is without replacement, you can adapt the above technique by removing the selected element from the list after each iteration, then re-normalizing the weights so that their sum is 1 (valid probability distribution function)

Solution 4

I've done this in Ruby

https://github.com/fl00r/pickup

require 'pickup'
pond = {
  "selmon"  => 1,
  "carp" => 4,
  "crucian"  => 3,
  "herring" => 6,
  "sturgeon" => 8,
  "gudgeon" => 10,
  "minnow" => 20
}
pickup = Pickup.new(pond, uniq: true)
pickup.pick(3)
#=> [ "gudgeon", "herring", "minnow" ]
pickup.pick
#=> "herring"
pickup.pick
#=> "gudgeon"
pickup.pick
#=> "sturgeon"

Solution 5

If you want to pick x elements from a weighted set without replacement such that elements are chosen with a probability proportional to their weights:

import random

def weighted_choose_subset(weighted_set, count):
    """Return a random sample of count elements from a weighted set.

    weighted_set should be a sequence of tuples of the form 
    (item, weight), for example:  [('a', 1), ('b', 2), ('c', 3)]

    Each element from weighted_set shows up at most once in the
    result, and the relative likelihood of two particular elements
    showing up is equal to the ratio of their weights.

    This works as follows:

    1.) Line up the items along the number line from [0, the sum
    of all weights) such that each item occupies a segment of
    length equal to its weight.

    2.) Randomly pick a number "start" in the range [0, total
    weight / count).

    3.) Find all the points "start + n/count" (for all integers n
    such that the point is within our segments) and yield the set
    containing the items marked by those points.

    Note that this implementation may not return each possible
    subset.  For example, with the input ([('a': 1), ('b': 1),
    ('c': 1), ('d': 1)], 2), it may only produce the sets ['a',
    'c'] and ['b', 'd'], but it will do so such that the weights
    are respected.

    This implementation only works for nonnegative integral
    weights.  The highest weight in the input set must be less
    than the total weight divided by the count; otherwise it would
    be impossible to respect the weights while never returning
    that element more than once per invocation.
    """
    if count == 0:
        return []

    total_weight = 0
    max_weight = 0
    borders = []
    for item, weight in weighted_set:
        if weight < 0:
            raise RuntimeError("All weights must be positive integers")
        # Scale up weights so dividing total_weight / count doesn't truncate:
        weight *= count
        total_weight += weight
        borders.append(total_weight)
        max_weight = max(max_weight, weight)

    step = int(total_weight / count)

    if max_weight > step:
        raise RuntimeError(
            "Each weight must be less than total weight / count")

    next_stop = random.randint(0, step - 1)

    results = []
    current = 0
    for i in range(count):
        while borders[current] <= next_stop:
            current += 1
        results.append(weighted_set[current][0])
        next_stop += step

    return results
Share:
38,446
ldasilva
Author by

ldasilva

Updated on July 05, 2022

Comments

  • ldasilva
    ldasilva almost 2 years

    Selecting without any weights (equal probabilities) is beautifully described here.

    I was wondering if there is a way to convert this approach to a weighted one.

    I am also interested in other approaches as well.

    Update: Sampling without replacement

  • ldasilva
    ldasilva over 14 years
    let's say I have 4 elements in the list with weights {2, 1, 1, 1}. I am going to choose 3 from this list. According to your formula, for the first element 3* 2/5 = 1.2 Which is >1 What am I missing here?
  • Kyle Butt
    Kyle Butt over 14 years
    now with {2,1,1,1} choose 3, the probability of picking the first element is 1 - (1 - (3/5))/2 = 1 - (2/5)/2 = 1 - 1/5 = 4/5, as expected.
  • ldasilva
    ldasilva over 14 years
    I believe there is something wrong with your formula, I don't have time to write it now, I will when I have time. If you try to apply formula for the first two elements in different orders, you will see it won't produce the same results. It should provide the same results regardless of the order.
  • ldasilva
    ldasilva over 14 years
    say there are 4 elements with weights {7, 1, 1, 1} and we are going to pick 2. Lets calculate the chance of picking first 2 elements: P(1st)*P(2nd) = (1-(1-2/10)/7)*(1-(1-1/3)) = 31/105. Let's change the list to {1, 7, 1, 1}, probability of picking the first 2 elements should remain same P(1st)*P(2nd) = (1-(1-2/10))*(1-(1-1/9)/7) = 11/63. They are not same.
  • Oak
    Oak over 14 years
    The way I see it, the question is about selecting multiple items in a single pass (see the linked question). Your approach requires a pass for each selection.
  • Sparr
    Sparr over 14 years
    +1 for awesome use of Python control structure variations that I haven't seen before
  • Darius Bacon
    Darius Bacon over 14 years
    See my answer to another question for a Python implementation of the binary-tree method: stackoverflow.com/questions/526255/…
  • Jason Orendorff
    Jason Orendorff over 14 years
    Darius Bacon: Upvoted! While I was fiddling with this I found that you don't need a tree. It can be done with a heap. So I added an implementation of my own to this answer.
  • Jason Orendorff
    Jason Orendorff over 14 years
    +1, this wins big on clarity. But note that the roulette-wheel algorithm takes O(n log m + m) time, where n is the number of samples and m is the number of items. That's if you omit the sorting, which is unnecessary, and do a binary search in step 4. Also, it requires O(m) space for the cumulative weights. In my answer there's a 14-line function that does the same thing in O(n + m) time and O(1) space.
  • ldasilva
    ldasilva over 14 years
    If I have to remove selected element I need to copy the whole list, I am assuming we are not allowed to do any modification on the input list, which is expensive.
  • Jason Orendorff
    Jason Orendorff over 14 years
    Even simpler, say there are 2 elements with weights {1/2, 1/2} and we are going to pick 2. The probability should be 1; we have to take both. But the formula gives 1 - (1 - (2/1)))/(1/2) = 3.
  • Jason Orendorff
    Jason Orendorff over 14 years
    Unfortunately I don't see an easy way to simplify calculating the real probability, which is P(a, n) = if n == 0 then 0 else 1 - sum i=1..|a| of (weight(a[i]) / total_weight(a) * (1 - P(a omitting a[i], n - 1))).
  • Darius Bacon
    Darius Bacon about 14 years
    Nice idea with the heap! I found the name 'heap' misleading for what you did, though, since that's usually taken to mean a structure that obeys the heap invariant on the weights. Here the heap invariant does apply, but to the .tw field and it's incidental to how your code works, if I understand it right.
  • Darius Bacon
    Darius Bacon about 14 years
    BTW it looks like you could avoid leaving a 0 in the array by moving the array's last entry into its place, updating the .tw's back up to the root from the old and new positions. This doesn't appear worth it for this static problem, but it would for a dynamic one like my post was about. I'll update that post to point here.
  • Jason Orendorff
    Jason Orendorff about 14 years
    Yes, I agree with both those comments. (I couldn't think of a better name for it than "heap", though. I guess it might be better call it a binary tree, and treat the array as an implementation detail. But that's confusing too!)
  • Darius Bacon
    Darius Bacon over 13 years
    There's a standard structure something like your heap known as the Fenwick tree: en.wikipedia.org/wiki/Fenwick_tree -- from a quick skim it doesn't sound the same, but I didn't really look into it.
  • Admin
    Admin over 13 years
    do you need to sort the weights? is it necassary?
  • Amro
    Amro over 13 years
    @Zenikoder: I believe you right in that sorting is not needed, it's just that it might make the search in step 4 easier...
  • Tim Gilbert
    Tim Gilbert about 12 years
    I am confused, if each node in the tree has one child at h[i>>1] and one at h[(i<<1)+1], why does rws_heap() only calculate the total weights for h[i>>1]? Isn't it only getting the total weights for the right-hand children?
  • Tim Gilbert
    Tim Gilbert about 12 years
    Oh, per my earlier comment, I now see that h[i<<1] is a node's right-hand child and h[i>>1] is its parent, now I see - I mistook the direction of the bit-shift. I'll add a note to the comment about that. (Edit: and now I see the information is already in there, I just missed it. Sorry!)
  • sw.
    sw. about 12 years
    Do you think that using a Fenwick tree can help here?
  • Pawel
    Pawel about 12 years
    Without replacement method proposed is not good. Google for "systematic sampling with unequal probabilities". There's a O(n) algorithm and no need to recompute weights.
  • chrtan
    chrtan almost 12 years
    Is there any way to implement the Sampling without Replacement to not go through the O(m) path of constructing the heap when using weighted random sampling multiple times? (So I select 5 items from the heap, and then I want to select another 5 items from the same heap.) Right now, I have code that has the original values of the heap in a dictionary by index, "repairing" all indexes that have been modified. However, I want to avoid this step entirely. I tried copy.deepcopy for the heap but it yields much poorer performance than my current "repair" code.
  • Jason Orendorff
    Jason Orendorff almost 12 years
    The "repair" strategy is probably the fastest. You can speed that up by storing the original values in each Node, rather than a separate dictionary.
  • chrtan
    chrtan almost 12 years
    @JasonOrendorff Ahh. Got so excited about this working that I completely forgot to thank you! So thank you Jason for the speed up and the sample code.
  • Gabriel
    Gabriel over 11 years
    I tried to port this solution to C# but didn't get the expected results. Even when picking a single card the distribution is not respected: stackoverflow.com/questions/11775946/…
  • Antonin
    Antonin over 11 years
    I implemented this for items being a set of objects. One attribute of the object is a weight. I modified "for w, v in items: h.append(Node(w, v, w))" into "for object in items: h.append(Node(object.weight, object, object.weight". Is it the correct way to do it?
  • Antonin
    Antonin over 11 years
    I have an error: "if gas > h[i].tw:" / "IndexError: list index out of range". This error only happens when I'm using n=20, and doesn't appear for n=2... Any idea why?
  • krlmlr
    krlmlr about 11 years
    The "without replacement" strategy seems to be the same as suggested in the 1980 paper by Wong and Easton: An Efficient Method for Weighted Sampling without Replacement. Or did I miss something?
  • John
    John about 11 years
    JavaScript port (in case people want it): gist.github.com/seejohnrun/5291246
  • John
    John about 11 years
    Awesome answer btw - NOTE: you need to check the number of elements before selecting > items.length elements
  • ech
    ech over 9 years
    If you want to pick x elements from a weighted set without replacement such that elements are chosen with a probability proportional to their weights, this random_weighted_sample_no_replacement() is not correct. Consider an input of {'a': 1, 'b': 2, 'c': 3}, pick 2. In this case, 'c' should be as likely as 'a' and 'b' combined, which means 'c' must be chosen on every invocation of the function. But random_weighted_sample_no_replacement() won't return a 'c' every time.
  • Jason Orendorff
    Jason Orendorff over 9 years
    @ErikHaugen The way I interpreted the problem was that each time an element is selected, the remaining possibilities are chosen with probability proportional to their weights. Otherwise many inputs would be invalid: for example {'a': 1, 'b': 1, 'c': 4}, pick 2. It is impossible to produce 'c' twice as often as 'a' and 'b' combined.
  • ech
    ech over 9 years
    @JasonOrendorff—You're right that some inputs don't work if we want weights to be proportional to likelihood of being chosen. But it seems fairly likely that someone wanting a "choose a subset of k weighted elements" algorithm means this by "weighted".
  • ech
    ech about 9 years
    @Jason Orendorff — What exactly does "weight" mean in random_weighted_sample_no_replacement()? In other words, if a client of this function cared about the exact behavior of it, what requirements might they have that would cause them to want this behavior? What properties does it have? I'm asking because I can't think of a meaningful answer, and if there isn't one then I think it might be a good idea to not leave this implementation here for people to find.
  • Jason Orendorff
    Jason Orendorff about 9 years
    @ech I would understand your concern better if you explained why the no-replacement algorithm bothers you and the with-replacement algorithm does not. Both use the same interpretation of "weight" (i.e. it's a measure of the probability of being picked in any given selection, not the probability of being picked overall after n selections)
  • Jason Orendorff
    Jason Orendorff about 9 years
    I think you can eliminate the correlation between selected elements by making a copy of weighted_set at the beginning and shuffling it.
  • Jason Orendorff
    Jason Orendorff about 9 years
    On reflection I'm not sure how I would go about proving that.
  • ech
    ech about 9 years
    Same here on both counts :)
  • ech
    ech about 9 years
    Your with-replacement algorithm seems fine because "weight" means "relative probability of being chosen", which I think is what people mean by "weight" in this context. Your no-replacement algorithm seems problematic to me because "weight" doesn't mean that. I'm not sure if "weight" has a useful meaning in your no-replacement algorithm, which is why I'm asking if there is one.
  • Jason Orendorff
    Jason Orendorff about 9 years
    OK but we've been over this. The meaning of "weight" is that an item's probability of being picked, on a given selection, is proportional to its weight relative to the weight of other items still remaining to be selected.
  • Jason Orendorff
    Jason Orendorff about 9 years
    This has nice per-selection properties (e.g. if you draw a sample of N items, then draw a sample of M items from the leftovers, you get exactly the same distribution as if you pulled N+M items in the first place). Yours has end-to-end properties instead. Both seem useful to me.
  • ech
    ech about 9 years
    My point is that the probability of being picked being proportional to the weight relative to the remaining items seems like an implementation detail. I can't figure out why someone might want that behavior. The "per-selection" property you outline seems interesting, but it's hard to think in those terms when "the same distribution" isn't something that anyone wants in the first place. (Unless it is...) Perhaps an example of when a client might want this would help me?
  • ech
    ech about 9 years
    I think you ought to at least include a comment there clarifying what "weight" means in that example, since it isn't what people normally mean by that term.
  • Joe K
    Joe K almost 9 years
    Building the heap should be O(m log m), no? You need to sum up O(log m) items for each element. So overall O((m+n) log m)?
  • Jason Orendorff
    Jason Orendorff almost 9 years
    @JoeK No, it's O(m). The algorithm in rws_heap() looks linear, right? Normally building a heap would be O(m log m), but this is an unusual use of a heap. Here the nodes do not need to be sorted at all. So we don't do any comparisons or swaps when building the heap. Rather, each node is assigned a number, the total weight tw, that's naturally greater than its child nodes' total weight. So the heap property is true "by construction", if that makes sense.
  • Jason Orendorff
    Jason Orendorff almost 9 years
    @JoeK A previous commenter said "I found the name 'heap' misleading for what you did", and I had to agree. I don't know a better word for it, though!
  • Joe K
    Joe K almost 9 years
    @JasonOrendorff ah, you're right. I read the description and only skimmed the code (as I'm not a python guy) and assumed that you would be updating the log(m) parent values as you insert new values. But doing it from the bottom up at the end in one pass is of course O(m) (and very clever!)
  • ldasilva
    ldasilva almost 9 years
    I think this is the only correct answer here after so many years. I once found this correct way but never remembered to enter here. I will read your answer thoroughly and accept it.
  • robbat2
    robbat2 over 8 years
    This version returns wrong answers compared to Jason Orendorff's posting. Specifically, on weights like pick(4, unique) from [1,1,1,1,9996], the results of the low-weight items are NOT even.
  • Jason Orendorff
    Jason Orendorff about 8 years
    @Widdershins I backed out your edit changing > to >= in two places. Note that gas is a random floating-point number, so it's very unlikely to matter. But if we do run out of gas exactly on the boundary between two regions, I think we should probably round down. random.random() returns a number 0 <= n < 0.5 with probability 1/2 and a number 0.5 <= n < 1 with probability 1/2; note that the exact value 0.5 is in the first set, not the second.
  • Mumbleskates
    Mumbleskates about 8 years
    @Jasen Orendorff You should absolutely round up. 100% certainty. Re-read your statement about sets from 0.0->0.5->1.0 again very carefully, it's patently incorrect.
  • Mumbleskates
    Mumbleskates about 8 years
    @JasonOrendorff Just because you don't notice fencepost errors when using floating point numbers doesn't mean they are not there.
  • Jason Orendorff
    Jason Orendorff about 8 years
    Heh! You are totally right. 0.5 is in the second set. OK, I'm cool with >=.
  • Mumbleskates
    Mumbleskates about 8 years
    @JasonOrendorff Yeah, it took me like an hour trying to figure out why some elements were never returning. My data set is integer-weighted, and the algorithm works the same with integer weights but produces more obvious bugs. I wrote up tests for JS, Python, and pure-SQL and eventually wrapped my head around the whole algorithm. For things like this, I feel it's best to start out making it for integers, which will always generalize readily into floating point since they are largely isometric.
  • Royi
    Royi over 5 years
    Is there a reference to this method? I found Weighted Random Sampling which has similar idea. While I see the logic in that what I miss is a reference tho show this is the correct distribution.
  • Joe K
    Joe K over 5 years
    No, sorry, I don't have one; this was something I came up with on my own. So you're right to be skeptical. But I'd be surprised if someone smarter than I hadn't discovered it before and given it a more rigorous treatment than a stack overflow post.
  • gaspar
    gaspar over 4 years
    i implemented this in JavaScript but it does not give the expected results: jsfiddle.net/gasparl/kamub5rq/18 - any ideas why?
  • Stef
    Stef over 2 years
    "If the sampling is without replacement, you can adapt the above technique by removing the selected element from the list after each iteration, then re-normalizing the weights so that their sum is 1 (valid probability distribution function)" This is a tempting algorithm, but if you try to prove that it's correct, you'll find out it isn't: it actually skews the distribution. However, there is a very clever way to adapt the roulette-wheel selection to sampling without replacement.
  • Stef
    Stef over 2 years
    Sampling k items without replacement: (1) Normalize the weights w_i so that sum w_i = k. (2) Compute the cumulative weights (sorting isn't necessary). (3) Pick a random number x in [0,1]. (4) Select the k elements corresponding to the k numbers: x, x+1, x+2, ..., x+k-1.