Generate unique hashes for django models

22,945

Solution 1

I do not like this bit:

uuid = uuid[:5]

In the best scenario (uuid are uniformly distributed) you will get a collision with probability greater than 0.5 after 1k of elements!

It is because of the birthday problem. In a brief it is proven that the probability of collision exceeds 0.5 when number of elements is larger than square root from number of possible labels.

You have 0xFFFFF=10^6 labels (different numbers) so after a 1000 of generated values you will start having collisions.

Even if you enlarge length to -1 you have still problem here:

str(random.random())[2:]

You will start having collisions after 3 * 10^6 (the same calculations follows).

I think your best bet is to use uuid that is more likely to be unique, here is an example

>>> import uuid
>>> uuid.uuid1().hex
'7e0e52d0386411df81ce001b631bdd31'

Update If you do not trust math just run the following sample to see the collision:

 >>> len(set(hashlib.sha256(str(i)).hexdigest()[:5] for i in range(0,2000)))
 1999 # it should obviously print 2000 if there wasn't any collision

Solution 2

The ugly:

import random

From the documentation:

This module implements pseudo-random number generators for various distributions.

If anything, please use os.urandom

Return a string of n random bytes suitable for cryptographic use.

This is how I use it in my models:

import os
from binascii import hexlify

def _createId():
    return hexlify(os.urandom(16))

class Book(models.Model):
    id_book = models.CharField(max_length=32, primary_key=True, default=_createId)

Solution 3

Django 1.8+ has a built-in UUIDField. Here's the suggested implementation, using the standard library's uuid module, from the docs:

import uuid
from django.db import models

class MyUUIDModel(models.Model):
    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
    # other fields

For older django versions you can use the django-uuidfield package.

Solution 4

Use your database engine's UUID support instead of making up your own hash. Almost everything beyond SQLite supports them, so there's little reason to not use them.

Share:
22,945

Related videos on Youtube

lprsd
Author by

lprsd

I am a curious learner! You should follow me on twitter as @lprsd_

Updated on July 09, 2022

Comments

  • lprsd
    lprsd almost 2 years

    I want to use unique hashes for each model rather than ids.

    I implemented the following function to use it across the board easily.

    import random,hashlib
    from base64 import urlsafe_b64encode
    
    def set_unique_random_value(model_object,field_name='hash_uuid',length=5,use_sha=True,urlencode=False):
        while 1:
            uuid_number = str(random.random())[2:]
            uuid = hashlib.sha256(uuid_number).hexdigest() if use_sha else uuid_number
            uuid = uuid[:length]
            if urlencode:
                uuid = urlsafe_b64encode(uuid)[:-1]
            hash_id_dict = {field_name:uuid}
            try:
                model_object.__class__.objects.get(**hash_id_dict)
            except model_object.__class__.DoesNotExist:
                setattr(model_object,field_name,uuid)
                return
    

    I'm seeking feedback, how else could I do it? How can I improve it? What is good bad and ugly about it?

    • nikola
      nikola over 14 years
      Could you please clarify: Do you need random hashes, or unique-across-time-and-space hashes? I ask because often users just want the former but use the terms "unique" and "random" interchangeably.
    • Thomas Schwärzl
      Thomas Schwärzl over 11 years
      Could you post an update of how you fixed it? I'm searching for a solution.
    • dalore
      dalore over 9 years
      That's crazy to generate a unique id. If doing uuid just generate a uuid and don't change the length or anything. Set your model to use that field as the primary key. Just generate a uuid on save, don't even worry about collisions.
  • nikola
    nikola over 14 years
    The birthday problem actually applies to random number generation. However, Python's uuid package does not concern random-number generation in specific. Actually, uuid1() from your example is nowhere near random as in cryptographically secure. Just pointing this out in case someone might get the idea to equate Python's uuid package with random number generation.
  • Piotr Czapla
    Piotr Czapla over 14 years
    One think to notice is that urandom is much slower than pseudo random so if you don't need it for cryptographic reason it may not be worth using. On my mac osx it is 21 times slower. Consider: >>> timeit.Timer('import random; random.random()').timeit(100000) 0.1538231372833252 >>> timeit.Timer('import os; os.urandom(2)').timeit(100000) 3.1858959197998047
  • Piotr Czapla
    Piotr Czapla over 14 years
    I've just checked that uuid is even slower :)
  • sherbang
    sherbang over 11 years
    This is dangerous! If you have a key collision then your new record will silently overwrite your existing record.
  • nikola
    nikola over 11 years
    No, primary keys are required to be unique, so in case of a collision you'll get an IntegrityError. Nothing silent about that one.
  • Noe Nieto
    Noe Nieto over 11 years
    It would be nice to see some example on how to do that with Django's ORM.
  • Tan Nguyen
    Tan Nguyen over 10 years
    well, it is not "django" thing. In postgresql you have this postgresql.org/docs/8.3/static/datatype-uuid.html. I think in other databases, they have something similar
  • Flavian Hautbois
    Flavian Hautbois over 9 years
    You should use the uuid package. At least the implementations of uuid1, uuid2, ..., uuid5 follow a standardized norm.