How to generate random numbers to satisfy a specific mean and median in python?

13,133

Solution 1

One way to get a result really close to what you want is to generate two separate random ranges with length 100 that satisfies your median constraints and includes all the desire range of numbers. Then by concatenating the arrays the mean will be around 12 but not quite equal to 12. But since it's just mean that you're dealing with you can simply generate your expected result by tweaking one of these arrays.

In [162]: arr1 = np.random.randint(2, 7, 100)    
In [163]: arr2 = np.random.randint(7, 40, 100)

In [164]: np.mean(np.concatenate((arr1, arr2)))
Out[164]: 12.22

In [166]: np.median(np.concatenate((arr1, arr2)))
Out[166]: 6.5

Following is a vectorized and very much optimized solution against any other solution that uses for loops or python-level code by constraining the random sequence creation:

import numpy as np
import math

def gen_random(): 
    arr1 = np.random.randint(2, 7, 99)
    arr2 = np.random.randint(7, 40, 99)
    mid = [6, 7]
    i = ((np.sum(arr1 + arr2) + 13) - (12 * 200)) / 40
    decm, intg = math.modf(i)
    args = np.argsort(arr2)
    arr2[args[-41:-1]] -= int(intg)
    arr2[args[-1]] -= int(np.round(decm * 40))
    return np.concatenate((arr1, mid, arr2))

Demo:

arr = gen_random()
print(np.median(arr))
print(arr.mean())

6.5
12.0

The logic behind the function:

In order for us to have a random array with that criteria we can concatenate 3 arrays together arr1, mid and arr2. arr1 and arr2 each hold 99 items and the mid holds 2 items 6 and 7 so that make the final result to give as 6.5 as the median. Now we an create two random arrays each with length 99. All we need to do to make the result to have a 12 mean is to find the difference between the current sum and 12 * 200 and subtract the result from our N largest numbers which in this case we can choose them from arr2 and use N=50.

Edit:

If it's not a problem to have float numbers in your result you can actually shorten the function as following:

import numpy as np
import math

def gen_random(): 
    arr1 = np.random.randint(2, 7, 99).astype(np.float)
    arr2 = np.random.randint(7, 40, 99).astype(np.float)
    mid = [6, 7]
    i = ((np.sum(arr1 + arr2) + 13) - (12 * 200)) / 40
    args = np.argsort(arr2)
    arr2[args[-40:]] -= i
    return np.concatenate((arr1, mid, arr2))

Solution 2

Here, you want a median value lesser than the mean value. That means that a uniform distribution is not appropriate: you want many little values and fewer great ones.

Specifically, you want as many value lesser or equal to 6 as the number of values greater or equal to 7.

A simple way to ensure that the median will be 6.5 is to have the same number of values in the range [ 2 - 6 ] as in [ 7 - 40 ]. If you choosed uniform distributions in both ranges, you would have a theorical mean of 13.75, which is not that far from the required 12.

A slight variation on the weights can make the theorical mean even closer: if we use [ 5, 4, 3, 2, 1, 1, ..., 1 ] for the relative weights of the random.choices of the [ 7, 8, ..., 40 ] range, we find a theorical mean of 19.98 for that range, which is close enough to the expected 20.

Example code:

>>> pop1 = list(range(2, 7))
>>> pop2 = list(range(7, 41))
>>> w2 = [ 5, 4, 3, 2 ] + ( [1] * 30)
>>> r1 = random.choices(pop1, k=2500)
>>> r2 = random.choices(pop2, w2, k=2500)
>>> r = r1 + r2
>>> random.shuffle(r)
>>> statistics.mean(r)
12.0358
>>> statistics.median(r)
6.5
>>>

So we now have a 5000 values distribution that has a median of exactly 6.5 and a mean value of 12.0358 (this one is random, and another test will give a slightly different value). If we want an exact mean of 12, we just have to tweak some values. Here sum(r) is 60179 when it should be 60000, so we have to decrease 175 values which were neither 2 (would go out of range) not 7 (would change the median).

In the end, a possible generator function could be:

def gendistrib(n):
    if n % 2 != 0 :
        raise ValueError("gendistrib needs an even parameter")
    n2 = n//2     # n / 2 in Python 2
    pop1 = list(range(2, 7))               # lower range
    pop2 = list(range(7, 41))              # upper range
    w2 = [ 5, 4, 3, 2 ] + ( [1] * 30)      # weights for upper range
    r1 = random.choices(pop1, k=n2)        # lower part of the distrib.
    r2 = random.choices(pop2, w2, k=n2)    # upper part
    r = r1 + r2
    random.shuffle(r)                      # randomize order
    # time to force an exact mean
    tot = sum(r)
    expected = 12 * n
    if tot > expected:                     # too high: decrease some values
        for i, val in enumerate(r):
            if val != 2 and val != 7:
                r[i] = val - 1
                tot -= 1
                if tot == expected:
                    random.shuffle(r)      # shuffle again the decreased values
                    break
    elif tot < expected:                   # too low: increase some values
        for i, val in enumerate(r):
            if val != 6 and val != 40:
                r[i] = val + 1
                tot += 1
                if tot == expected:
                    random.shuffle(r)      # shuffle again the increased values
                    break
    return r

It is really fast: I could timeit gendistrib(10000) at less than 0.02 seconds. But it should not be used for small distributions (less than 1000)

Solution 3

Ok, you're looking at the distribution which has no less than 4 parameters - two of those defining range and two responsible for required mean and median.

I could think about two possibilities from the top of my head:

  1. Truncated normal distribution, look here for details. You have already range defined, and have to recover μ and σ from mean and median. It will require solving couple of nonlinear equation, but quite doable in python. Sampling could be done using https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.truncnorm.html

  2. 4-parameters Beta distribution, see here for details. Again, recovering α and β in Beta distribution from mean and median will require solving couple of non-linear equations. Knowing them sampling would be easy via https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.beta.html

UPDATE

Here how you could do it for truncated normal going from mean to mu: Truncated normal with a given mean

Share:
13,133
MWH
Author by

MWH

Updated on June 17, 2022

Comments

  • MWH
    MWH almost 2 years

    I would like to generate n random numbers e.g., n=200, where the range of possible values is between 2 and 40 with a mean of 12 and median is 6.5.

    I searched everywhere and i could not find a solution for this. I tried the following script by it works for small numbers such as 20, for big numbers it takes ages and result is returned.

    n=200
    x = np.random.randint(0,1,size=n) # initalisation only
    while True:
            if x.mean() == 12 and np.median(x) == 6.5:
                break
            else:
                x=np.random.randint(2,40,size=n)
    

    Could anyone help me by improving this to get a quick result even when n=5000 or so?

  • Mazdak
    Mazdak about 6 years
    @Graipher Because that's how the Python's random.randint()'s distribution function works. It's sort of a normal distribution that act in a way so that the mean is around range/2. Which range is the range of numbers you pass to random.randint. In this case one array gives 6/2 = 3 and the other one 34/2 = 17 and the median between these two is around 10. But this is a general speculation and since at the end the median relates to all the numbers it gives something more than 10.
  • Graipher
    Graipher about 6 years
    The Beta distribution has no closed form for its median (except for a few very simple edge cases), so it is in general impossible to recover the values for a and b. An approximate expression for a, b > 1 exists, but it is not applicable in this case (assuming the expression, the resulting a, b are < 1 for mu = 12 and median = 6.5).
  • Severin Pappadeux
    Severin Pappadeux about 6 years
    @Graipher while this is true, it is not quite relevant - on top of couple non-linear equations there would be inverse expression for median. Function itself is available in python, I believe docs.scipy.org/doc/scipy-0.14.0/reference/generated/…
  • Graipher
    Graipher about 6 years
    But you would have to solve incomplete beta integral(a,b) == 1/2 (and mu = a/(a+b)) to find the values of a, b. While the second one is easy, the first is not, since you would need the inverse of the incomplete beta integral. You could probably do it numerically, though.
  • Severin Pappadeux
    Severin Pappadeux about 6 years
    @Graipher Sure, will do it numerically. For truncated normal numerically as well.
  • MWH
    MWH about 6 years
    Thank you! it works fine and very fast. I just added random.shuffle(arr) at the end to shuffle the whole array.
  • Mazdak
    Mazdak about 6 years
    @MWH Yes, shuffle will make it even better.
  • MWH
    MWH about 6 years
    Just i noticed the number generated are all integer because of randint(), can we modify it to generate float (decimal ) numbers e.g., 2.4 and 30.01?
  • Mazdak
    Mazdak about 6 years
    @MWH Definitely, In that case the normalization function will even be more simplified because you can simply subtract i from top 40 numbers. You can just replace all the lines between i=... and return .. with arr2[args[-40:]] -= i.
  • MWH
    MWH about 6 years
    I cannot get it. tried to remove all lines between. got error as args is not defined!. could you please clarify what you mean?
  • Mazdak
    Mazdak about 6 years
    @MWH Yes I missed that. You should leave the line args = np.argsort(arr2). Just remove one line above it and two lines beneath it and arr2[args[-40:]] -= i.
  • mikuszefski
    mikuszefski almost 5 years
    It is a nice answer. Based on this, I'd take 2 independent distributions with support on a bound interval. scale them accordingly to have the above intervals. Chose the shape factors such that the means add up to the required mean.
  • mikuszefski
    mikuszefski almost 5 years
    Just noticed that it is also in the direction of Severin Pappadeux's solution.