How to choose keys from a python dictionary based on weighted probability?

12,472

Solution 1

def weighted_random_by_dct(dct):
    rand_val = random.random()
    total = 0
    for k, v in dct.items():
        total += v
        if rand_val <= total:
            return k
    assert False, 'unreachable'

Should do the trick. Goes through each key and keeps a running sum and if the random value (between 0 and 1) falls in the slot it returns that key

Solution 2

Starting in Python 3.6, you can use the built-in random.choices() instead of having to use Numpy.

So then, if we want to sample (with replacement) 25 keys from your dictionary where the values are the weights/probabilities of being sampled, we can simply write:

import random
random.choices(list(my_dict.keys()), weights=my_dict.values(), k=25)

This outputs a list of sampled keys:

['c', 'b', 'c', 'b', 'b', 'c', 'c', 'c', 'b', 'c', 'b', 'c', 'b', 'c', 'c', 'c', 'c', 'c', 'a', 'b']

If you just want one key, set k to 1 and extract the single element from the list that random.choices returns:

random.choices(list(my_dict.keys()), weights=my_dict.values(), k=1)[0]

(If you don't convert my_dict.keys() to a list, you'll get a TypeError about how it's not subscriptable.)

Here's the relevant snippet from the docs:

random.choices(population, weights=None, *, cum_weights=None, k=1)

Return a k sized list of elements chosen from the population with replacement. If the population is empty, raises IndexError.

If a weights sequence is specified, selections are made according to the relative weights. Alternatively, if a cum_weights sequence is given, the selections are made according to the cumulative weights (perhaps computed using itertools.accumulate()). For example, the relative weights [10, 5, 30, 5] are equivalent to the cumulative weights [10, 15, 45, 50]. Internally, the relative weights are converted to cumulative weights before making selections, so supplying the cumulative weights saves work.

If neither weights nor cum_weights are specified, selections are made with equal probability. If a weights sequence is supplied, it must be the same length as the population sequence. It is a TypeError to specify both weights and cum_weights.

The weights or cum_weights can use any numeric type that interoperates with the float values returned by random() (that includes integers, floats, and fractions but excludes decimals). Weights are assumed to be non-negative.

For a given seed, the choices() function with equal weighting typically produces a different sequence than repeated calls to choice(). The algorithm used by choices() uses floating point arithmetic for internal consistency and speed. The algorithm used by choice() defaults to integer arithmetic with repeated selections to avoid small biases from round-off error.

According to the comments at https://stackoverflow.com/a/39976962/5139284, random.choices is faster for small arrays, and numpy.random.choice is faster for big arrays. numpy.random.choice also provides an option to sample without replacement, whereas there's no built-in Python standard library function for that.

Solution 3

If you're planning to do this a lot, you could use numpy to select your keys from a list with weighted probabilities using np.random.choice(). The below example will pick your keys 10,000 times with the weighted probabilities.

import numpy as np

probs = [0.0625, 0.625, 0.3125]
keys = ['a', 'c', 'b']

choice_list = np.random.choice(keys, 10000, replace=True, p=probs)

Solution 4

Not sure what your use case is here, but you can check out the frequency distribution/probability distribution classes in the NLTK package, which handle all the nitty details.

FreqDist is an extension of a counter, which can be passed to a ProbDistI interface. The ProbDistI interface exposes a "generate()" method which can be used to sample the distribution, as well as a "prob(sample)" method that can be used to get the probability of a given key.

For your case you'd want to use Maximum Likelihood Estimation, so the MLEProbDist. If you want to smooth the distribution, you could try LaplaceProbDist or SimpleGoodTuringProbDist.

For example:

from nltk.probability import FreqDist, MLEProbDist

d = {'a': 6.25, 'c': 62.5, 'b': 31.25}
freq_dist = FreqDist(d)
prob_dist = MLEProbDist(freq_dist)

print prob_dist.prob('a')
print prob_dist.prob('b')
print prob_dist.prob('c')
print prob_dist.prob('d')

will print "0.0625 0.3125 0.625 0.0".

To generate a new sample, you can use:

prob_dist.generate()

Solution 5

If you are able to use numpy, you can use the numpy.random.choice function, like so:

import numpy as np

d = {'a': 0.0625, 'c': 0.625, 'b': 0.3125}

def pick_by_weight(d):
    d_choices = []
    d_probs = []
    for k,v in d.iteritems():
      d_choices.append(k)
      d_probs.append(v)
    return np.random.choice(d_choices, 1, p=d_probs)[0]


d = {'a': 0.0625, 'c': 0.625, 'b': 0.3125}
choice = pick_by_weight(d)
Share:
12,472
Joseph
Author by

Joseph

Updated on June 26, 2022

Comments

  • Joseph
    Joseph almost 2 years

    I have a Python dictionary where keys represent some item and values represent some (normalized) weighting for said item. For example:

    d = {'a': 0.0625, 'c': 0.625, 'b': 0.3125}
    # Note that sum([v for k,v in d.iteritems()]) == 1 for all `d`
    

    Given this correlation of items to weights, how can I choose a key from d such that 6.25% of the time the result is 'a', 32.25% of the time the result is 'b', and 62.5% of the result is 'c'?