Generating non-repeating random numbers in Python

30,015

Solution 1

This is a neat problem, and I've been thinking about it for a while (with solutions similar to Sjoerd's), but in the end, here's what I think:

Use your point 1) and stop worrying.

Assuming real randomness, the probability that a random number has already been chosen before is the count of previously chosen numbers divided by the size of your pool, i.e. the maximal number.

If you say you only need a billion numbers, i.e. nine digits: Treat yourself to 3 more digits, so you have 12-digit serial numbers (that's three groups of four digits – nice and readable).

Even when you're close to having chosen a billion numbers previously, the probability that your new number is already taken is still only 0,1%.

Do step 1 and draw again. You can still check for an "infinite" loop, say don't try more than 1000 times or so, and then fallback to adding 1 (or something else).

You'll win the lottery before that fallback ever gets used.

Solution 2

You could use Format-Preserving Encryption to encrypt a counter. Your counter just goes from 0 upwards, and the encryption uses a key of your choice to turn it into a seemingly random value of whatever radix and width you want.

Block ciphers normally have a fixed block size of e.g. 64 or 128 bits. But Format-Preserving Encryption allows you to take a standard cipher like AES and make a smaller-width cipher, of whatever radix and width you want (e.g. radix 10, width 9 for the parameters of the question), with an algorithm which is still cryptographically robust.

It is guaranteed to never have collisions (because cryptographic algorithms create a 1:1 mapping). It is also reversible (a 2-way mapping), so you can take the resulting number and get back to the counter value you started with.

AES-FFX is one proposed standard method to achieve this.

I've experimented with some basic Python code for AES-FFX--see Python code here (but note that it doesn't fully comply with the AES-FFX specification). It can e.g. encrypt a counter to a random-looking 7-digit decimal number. E.g.:

0000000   0731134
0000001   6161064
0000002   8899846
0000003   9575678
0000004   3030773
0000005   2748859
0000006   5127539
0000007   1372978
0000008   3830458
0000009   7628602
0000010   6643859
0000011   2563651
0000012   9522955
0000013   9286113
0000014   5543492
0000015   3230955
...       ...

For another example in Python, using another non-AES-FFX (I think) method, see this blog post "How to Generate an Account Number" which does FPE using a Feistel cipher. It generates numbers from 0 to 2^32-1.

Solution 3

With some modular arithmic and prime numbers, you can create all numbers between 0 and a big prime, out of order. If you choose your numbers carefully, the next number is hard to guess.

modulo = 87178291199 # prime
incrementor = 17180131327 # relative prime

current = 433494437 # some start value
for i in xrange(1, 100):
    print current
    current = (current + incrementor) % modulo

Solution 4

If you don't need something cryptographically secure, but just "sufficiently obfuscated"...

Galois Fields

You could try operations in Galois Fields, e.g. GF(2)32, to map a simple incrementing counter x to a seemingly random serial number y:

x = counter_value
y = some_galois_function(x)
  • Multiply by a constant
    • Inverse is to multiply by the reciprocal of the constant
  • Raise to a power: xn
  • Reciprocal x-1
    • Special case of raising to power n
    • It is its own inverse
  • Exponentiation of a primitive element: ax

Many of these operations have an inverse, which means, given your serial number, you can calculate the original counter value from which it was derived.

As for finding a library for Galois Field for Python... good question. If you don't need speed (which you wouldn't for this) then you could make your own. I haven't tried these:

Matrix multiplication in GF(2)

Pick a suitable 32×32 invertible matrix in GF(2), and multiply a 32-bit input counter by it. This is conceptually related to LFSR, as described in S.Lott's answer.

CRC

A related possibility is to use a CRC calculation. Based on the remainder of long-division with an irreducible polynomial in GF(2). Python code is readily available for CRCs (crcmod, pycrc), although you might want to pick a different irreducible polynomial than is normally used, for your purposes. I'm a little fuzzy on the theory, but I think a 32-bit CRC should generate a unique value for every possible combination of 4-byte inputs. Check this. It's quite easy to experimentally check this, by feeding the output back into the input, and checking that it produces a complete cycle of length 232-1 (zero just maps to zero). You may need to get rid of any initial/final XORs in the CRC algorithm for this check to work.

Solution 5

If they don't have to be random, but just not obviously linear (1, 2, 3, 4, ...), then here's a simple algorithm:

Pick two prime numbers. One of them will be the largest number you can generate, so it should be around one billion. The other should be fairly large.

max_value = 795028841
step = 360287471
previous_serial = 0
for i in xrange(0, max_value):
    previous_serial += step
    previous_serial %= max_value
    print "Serial: %09i" % previous_serial

Just store the previous serial each time so you know where you left off. I can't prove mathmatically that this works (been too long since those particular classes), but it's demonstrably correct with smaller primes:

s = set()
with open("test.txt", "w+") as f:
    previous_serial = 0
    for i in xrange(0, 2711):
        previous_serial += 1811
        previous_serial %= 2711
        assert previous_serial not in s
        s.add(previous_serial)

You could also prove it empirically with 9-digit primes, it'd just take a bit more work (or a lot more memory).

This does mean that given a few serial numbers, it'd be possible to figure out what your values are--but with only nine digits, it's not likely that you're going for unguessable numbers anyway.

Share:
30,015
bigredbob
Author by

bigredbob

Updated on November 30, 2020

Comments

  • bigredbob
    bigredbob over 3 years

    Ok this is one of those trickier than it sounds questions so I'm turning to stack overflow because I can't think of a good answer. Here is what I want: I need Python to generate a simple a list of numbers from 0 to 1,000,000,000 in random order to be used for serial numbers (using a random number so that you can't tell how many have been assigned or do timing attacks as easily, i.e. guessing the next one that will come up). These numbers are stored in a database table (indexed) along with the information linked to them. The program generating them doesn't run forever so it can't rely on internal state.

    No big deal right? Just generate a list of numbers, shove them into an array and use Python "random.shuffle(big_number_array)" and we're done. Problem is I'd like to avoid having to store a list of numbers (and thus read the file, pop one off the top, save the file and close it). I'd rather generate them on the fly. Problem is that the solutions I can think of have problems:

    1) Generate a random number and then check if it has already been used. If it has been used generate a new number, check, repeat as needed until I find an unused one. Problem here is that I may get unlucky and generate a lot of used numbers before getting one that is unused. Possible fix: use a very large pool of numbers to reduce the chances of this (but then I end up with silly long numbers).

    2) Generate a random number and then check if it has already been used. If it has been used add or subtract one from the number and check again, keep repeating until I hit an unused number. Problem is this is no longer a random number as I have introduced bias (eventually I will get clumps of numbers and you'd be able to predict the next number with a better chance of success).

    3) Generate a random number and then check if it has already been used. If it has been used add or subtract another randomly generated random number and check again, problem is we're back to simply generating random numbers and checking as in solution 1.

    4) Suck it up and generate the random list and save it, have a daemon put them into a Queue so there are numbers available (and avoid constantly opening and closing a file, batching it instead).

    5) Generate much larger random numbers and hash them (i.e. using MD5) to get a smaller numeric value, we should rarely get collisions, but I end up with larger than needed numbers again.

    6) Prepend or append time based information to the random number (i.e. unix timestamp) to reduce chances of a collision, again I get larger numbers than I need.

    Anyone have any clever ideas that will reduce the chances of a "collision" (i.e. generating a random number that is already taken) but will also allow me to keep the number "small" (i.e. less than a billion (or a thousand million for your europeans =)).

    Answer and why I accepted it:

    So I will simply go with 1, and hope it's not an issue, however if it is I will go with the deterministic solution of generating all the numbers and storing them so that there is a guarentee of getting a new random number, and I can use "small" numbers (i.e. 9 digits instead of an MD5/etc.).

  • bigredbob
    bigredbob over 14 years
    They need to be human readable (so serial numbers are better than (68123erghdqwoc which would let me express say MD5 in only 16 characters). If I arbitrarily shorten SHA1 I get collisions just as regularly as I would for random numbers and run into the problem in solution 1. Like I said I'd like to keep the numbers between 1 and a billion.
  • James
    James over 14 years
    +1 Neat, don't have to store a list of numbers, and would be very satisfying for anyone who did reverse-engineer it! :)
  • Omran
    Omran over 14 years
    If someone had three serials which are issued right after each other, wouldn't it be easy to deduce the logic?
  • Glenn Maynard
    Glenn Maynard over 14 years
    (Ugh. Why is this site making well-established users randomly enter captchas? It's absurd.)
  • Glenn Maynard
    Glenn Maynard over 14 years
    You'd need to populate a table with a billion rows, which is a pretty inefficient. Don't forget that you need to remove each number as you use it.
  • Aspen van Wood
    Aspen van Wood over 14 years
    That's what I would advocate too.
  • Aspen van Wood
    Aspen van Wood over 14 years
    There is no guarantee that SHA hashes (especially shortened) are unique.
  • Aspen van Wood
    Aspen van Wood over 14 years
    Since he is talking about serial numbers, and has a database table storing them, I assume collisions would be pretty annoying (as in "mixing up two different customers" perhaps).
  • ghostdog74
    ghostdog74 over 14 years
    I think you should watch your tone. using a long hash is definitely a viable solution. If you are so clever , try to make SHA256 hash break then.
  • Andrew McGregor
    Andrew McGregor over 14 years
    Sure, but if they're more like software keys, then that may be a mere annoyance, as opposed to a serious bug.
  • Aspen van Wood
    Aspen van Wood over 14 years
    I'm sorry but suggesting to "take first 3/4/5 characters of the hash" when you want a genuinely unique number is stupid. Not to mention you have to take the hash of something anyway, so that something itself still has to be unique...
  • Khelben
    Khelben over 14 years
    Just add other 3 digits more if you want to be EXTRA sure! ;-)
  • jbochi
    jbochi over 14 years
    @Johan: Not if the prime numbers are really large. All modern cryptographic algorithms that I know rely on the fact that primes are hard to factorize.
  • Sjoerd
    Sjoerd over 14 years
    @Johan: You are right. If you have three or more serials the next number would be fairly easy to guess, no matter what your primes are. I was mistaken with another algoritm.
  • balpha
    balpha over 14 years
    @jbochi: That has nothing to do with this very algorithm. Also: I'll factorize any prime number in O(1), without a computer. But I guess we know what you meant :)
  • Sjoerd
    Sjoerd over 14 years
    With a multiplicative ring, the next number would be harder to predict. E.g. use 999998971 as prime, 1 as start, and 43 as primitive root. For a serial, multiply the previous serial by 43 modulo 999998971.
  • Sudhir Jonathan
    Sudhir Jonathan over 14 years
    Hmmm. Not to get into a flame war here, but when what you want is a to be able to assign a different key to a lot of items, you could do it. It might also be worth looking into URL shortening services and Git, both of which have been doing this for quite some time now.
  • bigredbob
    bigredbob over 14 years
    The annoyance is that if I create a list, shuffle and store then it's deterministic with 0 chance of using an already used number, plus I don't have to use overly long numbers.
  • balpha
    balpha over 14 years
    @bigredbob: I get that. But 1) 12 digits isn't really overly long, and 2) I'm just saying that the drawbacks of the simplest solution aren't nearly as big as you think. Don't forget that the probabilities multiply.
  • Hans-Peter Störr
    Hans-Peter Störr over 12 years
    Why should cryptographically strong exclude zero collisions? If you, for instance, use a sequence and encrypt each value with an algorithm that has output range 0-1000000000, you have both. It's not quite easy to find a strong algorithm that does this, though.
  • cjh
    cjh over 11 years
    Please provide more information about the nature of your solution.