Create constrained random numbers?

python algorithm statistics probability

10,172

Solution 1

Suppose you need n_1 in [10,30], n_2 in [20,40], n_3 in [30,50] and n1+n2+n3=90

If you need each possible triplet (n_1, n_2, n_3) to be equally-likely, that's going to be difficult. The number of triples of the form (20, n_2, n_3) is greater than the number of triples (10, n_2, n_3), so you can't just pick n_1 uniformly.

The incredibly slow but accurate way is to generate the all 5 randoms in the correct ranges and reject the whole group if the sum is not correct.

. . . AHA!

I found a way to parametrize the choice effectively. First, though, for simplicity note that the sum of the low bounds is the minimum possible sum. If subtract the sum of the low bounds from the target number and subtract the low bound from each generated number, you get a problem where each number is in the interval [0, max_k-min_k]. That simplifies the math and array (list) handling. Let n_k be the 0-based choice with 0<=n_k<=max_k-min_k.

The order of the sums is lexicographic, with all sums beginning with n_1=0 (if any) first, then n_1==1 sums, etc. Sums are sorted by n_2 in each of those groups, then by n_3, and so on. If you know how many sums add to the target (call that T), and how many sums start with n_1=0, 1, 2, ... then you can find the starting number n1 of sum number S in in that list. Then you can reduce the problem to adding n_2+n_3+... to get T-n_1, finding sum number S - (number original sums starting with number less than n_1).

Let pulse(n) be a list of n+1 ones: (n+1)*[1] in Python terms. Let max_k,min_k be the limits for the k'th choice, and m_k = max_k-min_k be the upper limit for 0-based choices. Then there are 1+m_1 different "sums" from the choice of the first number, and pulse(m_k) gives the distribution: 1 was to make each sum from 0 to m_1. For the first two choices, there are m_1+m_+1 different sums. It turns out that the convolution of pulse(m_1) with pulse(m_2) gives the distribution.

Time to stop for some code:

    def pulse(width, value=1):
        ''' Returns a vector of (width+1) integer ones. '''
        return (width+1)*[value]

    def stepconv(vector, width):
        ''' Computes the discrete convolution of vector with a "unit"
            pulse of given width.

            Formula: result[i] = Sum[j=0 to width] 1*vector[i-j]
            Where 0 <= i <= len(vector)+width-1, and the "1*" is the value
            of the implied unit pulse function: pulse[j] = 1 for 0<=j<=width.
        '''
        result = width*[0] + vector;
        for i in range(len(vector)):
            result[i] = sum(result[i:i+width+1])
        for i in range(len(vector), len(result)):
            result[i] = sum(result[i:])
        return result

That's coded specifically for only doing convolutions with a "pulse" array, so every linear combination in the convolution is just a sum.

Those are used only in the constructor of the final class solution:

class ConstrainedRandom(object):
    def __init__(self, ranges=None, target=None, seed=None):
        self._rand = random.Random(seed)
        if ranges != None: self.setrange(ranges)
        if target != None: self.settarget(target)

    def setrange(self, ranges):
        self._ranges = ranges
        self._nranges = len(self._ranges)
        self._nmin, self._nmax = zip(*self._ranges)
        self._minsum = sum(self._nmin)
        self._maxsum = sum(self._nmax)
        self._zmax = [y-x for x,y in self._ranges]
        self._rconv = self._nranges * [None]
        self._rconv[-1] = pulse(self._zmax[-1])
        for k in range(self._nranges-1, 0, -1):
            self._rconv[k-1] = stepconv(self._rconv[k], self._zmax[k-1])

    def settarget(self, target):
        self._target = target

    def next(self, target=None):
        k = target if target != None else self._target
        k = k - self._minsum;
        N = self._rconv[0][k]
        seq = self._rand.randint(0,N-1)
        result = self._nranges*[0]
        for i in range(len(result)-1):
            cv = self._rconv[i+1]
            r_i = 0
            while k >= len(cv):
                r_i += 1
                k -= 1
            while cv[k] <= seq:
                seq -= cv[k]
                r_i += 1
                k -= 1
            result[i] = r_i
        result[-1] = k # t
        return [x+y for x,y in zip(result, self._nmin)]
    
    # end clss ConstrainedRandom

Use that with:

ranges = [(low, high), (low, high), ...]
cr = ConstrainedRandom(ranges, target)
seq = cr.next();
print(seq)
assert sum(seq)==target

seq = cr.next(); # get then get the next one.

...etc. The class could be trimmed down a bit, but the main space overhead is in the _rconv list, which has the stored convolutions. That's roughly N*T/2, for O(NT) storage.

The convolutions only use the ranges, with a lot of randoms generated with the same constraints, the table construction time "amortizes away" to zero. The time complexity of .next() is roughly T/2 on average and O(T), in terms of the number of indexes into the _rconv lists.

To see how the algorithm works, assume a sequence of 3 zero-based choices, with max values (5,7,3), and a 0-based target T=10. Define or import the pulse and stepconv functions in an Idle session, then:

>>> pulse(5)
[1, 1, 1, 1, 1, 1]
>>> K1 = pulse (5)
>>> K2 = stepconv(K1, 7)
>>> K3 = stepconv(K2, 3)
>>> K1
[1, 1, 1, 1, 1, 1]
>>> K2
[1, 2, 3, 4, 5, 6, 6, 6, 5, 4, 3, 2, 1]
>>> K3
[1, 3, 6, 10, 14, 18, 21, 23, 23, 21, 18, 14, 10, 6, 3, 1]
>>> K3[10]
18
>>> sum(K3)
192
>>> (5+1)*(7+1)*(3+1)
192

K3[i] shows the number of different choice n_1, n_2, n_3 such that 0 <= n_k <= m_k and Σ n_k = i. Letting * mean convolution when applied to two of these lists. Then pulse(m_2)*pulse(m_3) is gives the distribution of sums of n_2 and n_3:

>>> R23 = stepconv(pulse(7),3)
>>> R23
[1, 2, 3, 4, 4, 4, 4, 4, 3, 2, 1]
>>> len(R23)
11

Every value from 0 to T=10 is (barely) possible, so any choice is possible for the first number and there are R23[T-n_1] possible triplets adding to T=10 that start with N1. So, once you've found that there are 18 possible sums adding to 10, generate a random number S = randint(18) and count down through the R23[T:T-m_1-1:-1] array:

>>> R23[10:10-5-1:-1]
[1, 2, 3, 4, 4, 4]
>>> sum(R23[10:10-5-1:-1])
18

Note the sum of that list is the total computed in K3[10] above. A sanity check. Anyway, if S==9 was the random choice, then find how many leading terms of that array can be summed without exceeding S. That's the value of n_1. In this case 1+2+3 <= S but 1+2+3+4 > S, so n_1 is 3.

As described above, you can then reduce the problem to find n_2. The final number (n_3 in this example) will be uniquely determined.

Solution 2

First, define your ranges:

ranges = [range(11,30), range(6,20), range(11,25)]

Then enumerate all possibilities:

def constrainedRandoms(ranges):
    answer = []
    for vector in itertools.product(*ranges):
        if sum(ranges) == 100:
            answer.append(vector)
    return answer

Equivalent one-liner:

answer = [v for v in itertools.product(*ranges) if sum(v)==100]

Then pick a random element from them:

myChoice = random.choice(answer)

EDIT:

The cartesian product (computed with itertools.product) itself doesn't eat much RAM (this is because itertools.product returns a generator, which uses O(1) space, but a lot of time as you pointed out). Only computing the list (answer) does. The bad news is that in order to use random.choice, you do need a list. If you really don't want to use a list, then you might need to come up with a decaying probability function. Here's a very simple probability function that you could use:

def constrainedRandoms(ranges):
    choices = (v for v in itertools.product(*ranges) if sum(v)==100) # note the parentheses. This is now a generator as well

    prob = 0.5
    decayFactor = 0.97 # set this parameter yourself, to better suit your needs
    try:
        answer = next(choices)
    except StopIteration:
        answer = None
    for choice in choices:
        answer = choice
        if random.uniform(0,1) >= prob:
            return answer
        prob *= decayFactor
    return answer

The decaying probability allows you increase the probability with which you will select the next vector that fits your constraints. Think about it this way: you have a bunch of constraints. Chances are, that you'll have a relatively small number of vectors that satisfy these constraints. This means that every time you decide not to use a vector, the probability that there is another such vector decreases. Therefore, over time, you need to be progressively more biased towards returning the current vector as "the randomly selected vector". Of course, it is still possible to go through the entire loop structure without ever returning a vector. This is why the code starts with a default of the first vector that fits the constraints (assuming one exists) and returns the last such vector if the decaying probability never allows for a vector to be returned.
Note that this decaying probability idea allows you to not have to iterate through all candidate vectors. With time, the probability of the code returning the current vector under consideration increases, thus reducing the probability of it continuing on and considering other vectors. This idea therefore helps mitigate your concerns about the runtime as well (though, I'm embarrassed to add, not by very much)

Solution 3

Try this:

import random
a = random.randint(10, 30)
b = random.randint(5, 20)
c = random.randint(10, 25)
d = random.randint(5, 15)
e = 100 - (a+b+c+d)

EDIT:

Here's how you'd generate a list of n random numbers, given n range constraints and the desired target sum:

def generate(constraints, limit):
    ans = [random.choice(c) for c in constraints]
    return ans if sum(ans) == limit else generate(constraints, limit)

This is how you'd use it:

generate([range(10,31),range(5,21),range(10,26),range(5,16),range(10,26)], 100)
=> [25, 20, 25, 12, 18]

Be aware, though: if the constraints don't guarantee that the sum is eventually reached, you'll get an infinite loop and a stack overflow error, for example:

generate([range(1,11), range(10, 21)], 100)
=> RuntimeError: maximum recursion depth exceeded while calling a Python object

10,172

Author by

Orvar Korvar

Updated on June 14, 2022

Comments

Orvar Korvar almost 2 years
CLEANED UP TEXT:

How can I create m=5 random numbers that add upp to, say n=100. But, the first random number is say, 10 < x1 < 30, the second random nr is 5 < x2 < 20, the third random nr is 10 < x3 < 25, etc. So these five random numbers add up to 100. How can I create these constrained five numbers?

.

[[

Related problem A1): The standard way to create five random numbers that add up to 100, is to sample four numbers between [0,100], and add the boundaries 0 and 100, and then sort these six numbers [0,x1,x2,x3,x4,100]. The five random numbers I seek, are the deltas. That is,
```
100 - x[4] = delta 5
x[4]- x[3] = delta 4
x[3]- x[2] = delta 3
x[2]- x[1] = delta 2
x[1] - 0   = delta 1
```
These five deltas will now add up to 100. For instance, they might be 0,1,2,7,90. Here is some code that solves this problem:
```
total_sum = 100
n = 5
v = numpy.random.multinomial(total_sum, numpy.ones(n)/n)
```
]]

.

For my problem, I can not allow wide intervals to occur, the largest spread above is 90-7 = 83 which is too wide. So, I have to specify a tighter spread, say [10,30]. This means the largest random number is 30, which disallows large spreads such as 83.

.

[[

Related problem A2): A partial solution to create five numbers with identical boundaries, 10 < x_i < 30, that adds up to 100 is like this: Just do like in A1) but add the lower boundary 10, to the deltas. So I get the five random numbers that I seek like this:
```
100 - x[4] = delta 5 + 10
x[4]- x[3] = delta 4 + 10
x[3]- x[2] = delta 3 + 10
x[2]- x[1] = delta 2 + 10
x[1] - 0   = delta 1 + 10
```
Basically, I do exactly like in A1) but do not start from 0, but start from 10. Thus, each number has the lower boundary 10, but they dont have an upper boundary, it can be large, too large. How to limit the upper boundary to 30? Here the problem is how to limit the upper boundary

]]

.

To recapitulate, the type of the problem I try to solve looks like this: I need five random numbers adding up to 100 and I need to specify the boundaries separately for each number, say [10,30] for the first random number, and then [5,10] for the second random number, and [15,35] for the third random number, etc. And they must all add up to 100.

But the real data I am using, has ~100 numbers x_i (m=50), all of them adding up to say ~400,000. And the range is typically [3000,5000] for a number x_i. These numbers are not really accurate, I am only trying to convey something about the problem size. The purpose is to do a MCMC simulation so these numbers need to be quickly generated. People have suggested very elegant solutions that really do work, but they take too long time, so I can not use them. The problem is still unsolved. Ideally I would like an O(m) solution and O(1) memory solution.

This problem should not be NP-hard, it doesnt feel like it. There should be a polynomial time solution, right?
BrenBarn over 10 years

I think this is the right way to get started, but note that this doesn't guarantee that e satisfies whatever constraints might exist on it (e.g., he might have a constraint that says 10 < e < 20, but with some random choices for the others, e could be forced to be 70).
Óscar López over 10 years

@BrenBarn I'm aware of that, but all the constraints must be explicitly specified (no "etc." as in the question). With that, it'd be possible to keep generating the 5 numbers in a loop until all constraints are satisfied
JeremyFromEarth over 10 years

What is the asterisk in (*ranges) about?
inspectorG4dget over 10 years

it does list and tuple unpacking. So in this case, it sends the sublists contained in ranges as individual arguments to itertools.product as opposed to sending ranges by itself, which is a list of lists
JeremyFromEarth over 10 years

Nice, that is a great feature. Thanks.
Óscar López over 10 years

@BrenBarn I updated my answer with a more general solution addressing your concerns
Orvar Korvar over 10 years

Your suggestion solves the problem I asked for, in a very neat way. Thank you. However, the time complexity is quite bad. Basically, you are creating the cartesian product (all possible combinations) and if one of the combination adds up to 100, it is a valid answer. I have many numbers so the cartesian product would be quite large, eating up RAM. Anyway, if I cant find a more efficient way, I will use your solution. At last I have a solution! Thanks! :o)
Orvar Korvar over 10 years

Thanks for your 2nd edited suggestion. It solves the problem too. As I understand it, you are basically generating random numbers and if they add up to 100, you are done. Otherwise, you generate some new numbers and try again. That might take time though. I would prefer a more efficient way. But I might use your solution if I cant find a more efficient way. Thanks! :)
Orvar Korvar over 10 years

A question, you are specifying four constraints, but your solution produce five random numbers. So the fifth number, how does it look like, what are it's constraints? I need to specify the fifth number too.
Óscar López over 10 years

@OrvarKorvar for a few numbers (say, 10? 100?) it's perfectly fine to retry until you find a set of numbers that match the given constraints. Granted, it's a brute-force approach, but it might be quick enough for your needs. Use it, profile it and if it's good enough, leave it as it is. Only optimize if it truly becomes a bottleneck, remember - early optimization is the root of all evil!
Orvar Korvar over 10 years

Thank you for your explanation on itertools.product returns a generator so it does not eat RAM. But, it costs much to create a cartesian product of 200 constraints looking like [-400, 600], right? This is what I have, I am going to do a MCMC simulation to find the optimum, so it needs to be quite quick. So, yes, you have a very neat solution, but costly. PS. I did not understand the decaying... something bit, could you explain it more? I dont mind using a list, so I dont care about decaying functions, but it made me curious.
Orvar Korvar over 10 years

Ok, I get it. You are basically doing: x5 = 100 - (x1+x2+x3+x4), so the value of x5 is what is leftover. Right? So, basically, x1, x2,x3, x4 could all have very low values, which makes x5 very large, maybe take the value 90? So this solution allows x5 to be very large. So I cant use your solution, but thanks for your suggestion! :)
Orvar Korvar over 10 years

Argh, I like your answer too, but it might run for a long time before finding random numbers that fulfill all constraints, theoretically forever. So I accept the above answer, which gives me an answer deterministically. Anyway, thank you! :)
Orvar Korvar over 10 years

Thank you. I tried this, and it took too much time and killed my PC. I am doing 100 numbers, with typical range(3000,6000). This takes too much time. I need a quicker solution. :(
inspectorG4dget over 10 years

@OrvarKorvar: The quickest solution I can think of is parallelization - get a bunch of parallel threads to return a vector from the generator. I suspect however, that a pruned, search approach might work better. However, I am unable to think of one off the top of my head at the moment. Will keep thinking about it and post later tonight if I think of something that might help. In the meantime, please clean up your question and add a formal problem description, which will help me understand the problem better and better communicate my solution
JeremyFromEarth over 10 years

x5 is only going to be as large as you allow it to be based on the constraints given. In the example, the largest x5 could be is 70. If that is too high just change the lowest number in each constraint.
inspectorG4dget over 10 years

@OrvarKorvar: We should clean up the comments (delete the comments that are no longer relavent). I think only the last two and first two comments are still relevant
Orvar Korvar over 10 years

It seems that there are some constraints in your solution? Like, the spans can not be arbitrary but must be descending in order by width? My problems need arbitrary spans. :(
Orvar Korvar over 10 years

Wow! This was really diligent. I dont really understand your solution yet, as it seems very advanced. I am impressed! :) Three questions: A) Could you please provide a high level overview to your approach, so I can read some and be able to understand your solution? B) Could you please provide a manual how to use your solution? An example with some constraints and how to get the solution. C) Does your solution have any constraints?
Orvar Korvar over 10 years

Clarification regarding A) something like "I am using discrete convolutions, because they have the property that adding two numbers will produce a third looking like this... "
James Waldby - jwpat7 over 10 years

Orvar, spans can be arbitrary but if they are out of width-order then results are returned out of order. There are half a dozen ways to deal with the problem. The simplest is to create a function that re-orders the result.
Mike Housky over 10 years

@OrvarKorvar: The "manual" is already in there, under "Use that with:" after the class definition. Type/paste the two functions and class into constrainedrandom.py, then from contrainedrandom import *, then try a few samples.
Mike Housky over 10 years

@orvarKorvar: I use discrete convolutions because they compute the distribution of the sum of two randoms. If random A has distribution D_A and random B has distribution D_B, then the distribution of the random sum (A+B) is the convolution of D_A and D_B. All of the material below the horizontal line, starting with "To see how the algorithm works" is my attempt at showing exactly what tables are needed and why. Import the pulse and stepconv functions and step through that.