Generating non-repeating random numbers in Python

python random numbers

30,015

Solution 1

This is a neat problem, and I've been thinking about it for a while (with solutions similar to Sjoerd's), but in the end, here's what I think:

Use your point 1) and stop worrying.

Assuming real randomness, the probability that a random number has already been chosen before is the count of previously chosen numbers divided by the size of your pool, i.e. the maximal number.

If you say you only need a billion numbers, i.e. nine digits: Treat yourself to 3 more digits, so you have 12-digit serial numbers (that's three groups of four digits – nice and readable).

Even when you're close to having chosen a billion numbers previously, the probability that your new number is already taken is still only 0,1%.

Do step 1 and draw again. You can still check for an "infinite" loop, say don't try more than 1000 times or so, and then fallback to adding 1 (or something else).

You'll win the lottery before that fallback ever gets used.

Solution 2

You could use Format-Preserving Encryption to encrypt a counter. Your counter just goes from 0 upwards, and the encryption uses a key of your choice to turn it into a seemingly random value of whatever radix and width you want.

Block ciphers normally have a fixed block size of e.g. 64 or 128 bits. But Format-Preserving Encryption allows you to take a standard cipher like AES and make a smaller-width cipher, of whatever radix and width you want (e.g. radix 10, width 9 for the parameters of the question), with an algorithm which is still cryptographically robust.

It is guaranteed to never have collisions (because cryptographic algorithms create a 1:1 mapping). It is also reversible (a 2-way mapping), so you can take the resulting number and get back to the counter value you started with.

AES-FFX is one proposed standard method to achieve this.

I've experimented with some basic Python code for AES-FFX--see Python code here (but note that it doesn't fully comply with the AES-FFX specification). It can e.g. encrypt a counter to a random-looking 7-digit decimal number. E.g.:

0000000   0731134
0000001   6161064
0000002   8899846
0000003   9575678
0000004   3030773
0000005   2748859
0000006   5127539
0000007   1372978
0000008   3830458
0000009   7628602
0000010   6643859
0000011   2563651
0000012   9522955
0000013   9286113
0000014   5543492
0000015   3230955
...       ...

For another example in Python, using another non-AES-FFX (I think) method, see this blog post "How to Generate an Account Number" which does FPE using a Feistel cipher. It generates numbers from 0 to 2^32-1.

Solution 3

With some modular arithmic and prime numbers, you can create all numbers between 0 and a big prime, out of order. ~~If you choose your numbers carefully, the next number is hard to guess.~~

modulo = 87178291199 # prime
incrementor = 17180131327 # relative prime
current = 433494437 # some start value
for i in xrange(1, 100):
    print current
    current = (current + incrementor) % modulo

Solution 4

If you don't need something cryptographically secure, but just "sufficiently obfuscated"...

Galois Fields

You could try operations in Galois Fields, e.g. GF(2)³², to map a simple incrementing counter x to a seemingly random serial number y:

x = counter_value
y = some_galois_function(x)

Multiply by a constant
- Inverse is to multiply by the reciprocal of the constant
Raise to a power: xⁿ
Reciprocal x^-1
- Special case of raising to power n
- It is its own inverse
Exponentiation of a primitive element: a^x
- Note that this doesn't have an easily-calculated inverse (discrete logarithm)
- Ensure a is a primitive element, aka generator

Many of these operations have an inverse, which means, given your serial number, you can calculate the original counter value from which it was derived.

As for finding a library for Galois Field for Python... good question. If you don't need speed (which you wouldn't for this) then you could make your own. I haven't tried these:

NZMATH
Finite field Python package
Sage, although it's a whole environment for mathematical computing, much more than just a Python library

Matrix multiplication in GF(2)

Pick a suitable 32×32 invertible matrix in GF(2), and multiply a 32-bit input counter by it. This is conceptually related to LFSR, as described in S.Lott's answer.

CRC

A related possibility is to use a CRC calculation. Based on the remainder of long-division with an irreducible polynomial in GF(2). Python code is readily available for CRCs (crcmod, pycrc), although you might want to pick a different irreducible polynomial than is normally used, for your purposes. I'm a little fuzzy on the theory, but I think a 32-bit CRC should generate a unique value for every possible combination of 4-byte inputs. Check this. It's quite easy to experimentally check this, by feeding the output back into the input, and checking that it produces a complete cycle of length 2³²-1 (zero just maps to zero). You may need to get rid of any initial/final XORs in the CRC algorithm for this check to work.

Solution 5

If they don't have to be random, but just not obviously linear (1, 2, 3, 4, ...), then here's a simple algorithm:

Pick two prime numbers. One of them will be the largest number you can generate, so it should be around one billion. The other should be fairly large.

max_value = 795028841
step = 360287471
previous_serial = 0
for i in xrange(0, max_value):
    previous_serial += step
    previous_serial %= max_value
    print "Serial: %09i" % previous_serial

Just store the previous serial each time so you know where you left off. I can't prove mathmatically that this works (been too long since those particular classes), but it's demonstrably correct with smaller primes:

s = set()
with open("test.txt", "w+") as f:
    previous_serial = 0
    for i in xrange(0, 2711):
        previous_serial += 1811
        previous_serial %= 2711
        assert previous_serial not in s
        s.add(previous_serial)

You could also prove it empirically with 9-digit primes, it'd just take a bit more work (or a lot more memory).

This does mean that given a few serial numbers, it'd be possible to figure out what your values are--but with only nine digits, it's not likely that you're going for unguessable numbers anyway.

View more solutions

30,015

Author by

bigredbob

Updated on November 30, 2020

Comments

bigredbob over 2 years

Ok this is one of those trickier than it sounds questions so I'm turning to stack overflow because I can't think of a good answer. Here is what I want: I need Python to generate a simple a list of numbers from 0 to 1,000,000,000 in random order to be used for serial numbers (using a random number so that you can't tell how many have been assigned or do timing attacks as easily, i.e. guessing the next one that will come up). These numbers are stored in a database table (indexed) along with the information linked to them. The program generating them doesn't run forever so it can't rely on internal state.

No big deal right? Just generate a list of numbers, shove them into an array and use Python "random.shuffle(big_number_array)" and we're done. Problem is I'd like to avoid having to store a list of numbers (and thus read the file, pop one off the top, save the file and close it). I'd rather generate them on the fly. Problem is that the solutions I can think of have problems:

1) Generate a random number and then check if it has already been used. If it has been used generate a new number, check, repeat as needed until I find an unused one. Problem here is that I may get unlucky and generate a lot of used numbers before getting one that is unused. Possible fix: use a very large pool of numbers to reduce the chances of this (but then I end up with silly long numbers).

2) Generate a random number and then check if it has already been used. If it has been used add or subtract one from the number and check again, keep repeating until I hit an unused number. Problem is this is no longer a random number as I have introduced bias (eventually I will get clumps of numbers and you'd be able to predict the next number with a better chance of success).

3) Generate a random number and then check if it has already been used. If it has been used add or subtract another randomly generated random number and check again, problem is we're back to simply generating random numbers and checking as in solution 1.

4) Suck it up and generate the random list and save it, have a daemon put them into a Queue so there are numbers available (and avoid constantly opening and closing a file, batching it instead).

5) Generate much larger random numbers and hash them (i.e. using MD5) to get a smaller numeric value, we should rarely get collisions, but I end up with larger than needed numbers again.

6) Prepend or append time based information to the random number (i.e. unix timestamp) to reduce chances of a collision, again I get larger numbers than I need.

Anyone have any clever ideas that will reduce the chances of a "collision" (i.e. generating a random number that is already taken) but will also allow me to keep the number "small" (i.e. less than a billion (or a thousand million for your europeans =)).

Answer and why I accepted it:

So I will simply go with 1, and hope it's not an issue, however if it is I will go with the deterministic solution of generating all the numbers and storing them so that there is a guarentee of getting a new random number, and I can use "small" numbers (i.e. 9 digits instead of an MD5/etc.).
bigredbob over 13 years

They need to be human readable (so serial numbers are better than (68123erghdqwoc which would let me express say MD5 in only 16 characters). If I arbitrarily shorten SHA1 I get collisions just as regularly as I would for random numbers and run into the problem in solution 1. Like I said I'd like to keep the numbers between 1 and a billion.
James over 13 years

+1 Neat, don't have to store a list of numbers, and would be very satisfying for anyone who did reverse-engineer it! :)
Omran over 13 years

If someone had three serials which are issued right after each other, wouldn't it be easy to deduce the logic?
Glenn Maynard over 13 years

(Ugh. Why is this site making well-established users randomly enter captchas? It's absurd.)
Glenn Maynard over 13 years

You'd need to populate a table with a billion rows, which is a pretty inefficient. Don't forget that you need to remove each number as you use it.
Aspen van Wood over 13 years

That's what I would advocate too.
Aspen van Wood over 13 years

There is no guarantee that SHA hashes (especially shortened) are unique.
Aspen van Wood over 13 years

Since he is talking about serial numbers, and has a database table storing them, I assume collisions would be pretty annoying (as in "mixing up two different customers" perhaps).
ghostdog74 over 13 years

I think you should watch your tone. using a long hash is definitely a viable solution. If you are so clever , try to make SHA256 hash break then.
Andrew McGregor over 13 years

Sure, but if they're more like software keys, then that may be a mere annoyance, as opposed to a serious bug.
Aspen van Wood over 13 years

I'm sorry but suggesting to "take first 3/4/5 characters of the hash" when you want a genuinely unique number is stupid. Not to mention you have to take the hash of something anyway, so that something itself still has to be unique...
Khelben over 13 years

Just add other 3 digits more if you want to be EXTRA sure! ;-)
jbochi over 13 years

@Johan: Not if the prime numbers are really large. All modern cryptographic algorithms that I know rely on the fact that primes are hard to factorize.
Sjoerd over 13 years

@Johan: You are right. If you have three or more serials the next number would be fairly easy to guess, no matter what your primes are. I was mistaken with another algoritm.
balpha over 13 years

@jbochi: That has nothing to do with this very algorithm. Also: I'll factorize any prime number in O(1), without a computer. But I guess we know what you meant :)
Sjoerd over 13 years

With a multiplicative ring, the next number would be harder to predict. E.g. use 999998971 as prime, 1 as start, and 43 as primitive root. For a serial, multiply the previous serial by 43 modulo 999998971.
Sudhir Jonathan over 13 years

Hmmm. Not to get into a flame war here, but when what you want is a to be able to assign a different key to a lot of items, you could do it. It might also be worth looking into URL shortening services and Git, both of which have been doing this for quite some time now.
bigredbob over 13 years

The annoyance is that if I create a list, shuffle and store then it's deterministic with 0 chance of using an already used number, plus I don't have to use overly long numbers.
balpha over 13 years

@bigredbob: I get that. But 1) 12 digits isn't really overly long, and 2) I'm just saying that the drawbacks of the simplest solution aren't nearly as big as you think. Don't forget that the probabilities multiply.
Hans-Peter Störr over 11 years

Why should cryptographically strong exclude zero collisions? If you, for instance, use a sequence and encrypt each value with an algorithm that has output range 0-1000000000, you have both. It's not quite easy to find a strong algorithm that does this, though.
cjh almost 11 years

Please provide more information about the nature of your solution.