How to choose between UUIDs, autoincrement/sequence keys and sequence tables for database primary keys?

24,718

UUIDs

Unless these are generated "in increasing monotonic sequence" they can drastically hurt/fragment indexes. Support for UUID generation varies by system. While usable, I would not use a UUID as my primary clustered index/PK in most cases. If needed I would likely make it a secondary column, perhaps indexed, perhaps not.

Some people argue that UUIDs can be used to safely generate/merge records from an arbitrary number of systems. While a UUID (depending upon method) generally has an astronomically small chance of collision, it is possible to -- at least with some outside input or very bad luck :) -- generate collisions. I am of the belief that only a true PK should be transmitted between systems, which I would argue is not (or should not be) a database-generated UUID in most cases.

autoincrement/sequence keys and sequence tables

This really depends on what the database supports well. Some databases support sequences which are more flexible that a simple "auto-increment". This may or may not be desirable (or may be the only way for this kind of task simply, even). Sequence tables are generally more flexible yet, but if this kind of "flexibility" is needed I would be tempted to go back and visit the design-pattern, especially if it involves the use of triggers. While I dislike "limiting ORMs", that may also make a difference in choosing the "simpler" auto-increment or sequence types/database support.

Regardless of the method used, when using surrogate primary keys, the true primary key should still be identified and encoded into the schema.

In addition, I argue that "security compromises through exposing an auto-sequence PK" are a result of incorrectly exposing an internal database property. While a very simple way to handle CRUD operation, I believe there is a distinction between the internal keys and the exposed keys (e.g. pretty customer number).

Just my two cents.

Edit, additional replies to Tim:

I think the generated vs. true PK question is a very good one and one I need to consider also. I'd like UUIDs in general to the points you make. My hesitation was in size vs. an int/long. Was not aware of potential indexing de-optimizations, which is a much bigger concern for me.

I wouldn't really worry about the size -- if a UUID is best, then it's best. If it's not, then it's not. In the overall scheme the extra 12bytes over an int likely won't make much of a difference. SQL Server 2005+ supports the newsequentialid UUID generation function to avoid the fragmentation associated with normal UUID generation. The page discusses it some. I am sure that other databases have similar solutions.

And by "encoded into the schema", do you mean more than adding a uniqueness constraint?

Yes. The primary key doesn't have to be the only [unique] constraint. Just using a surrogate PK doesn't mean the database model should be compromised :-) Additional indexes can also be used to cover, etc.

And by "distinction between", are you saying that surrogate primary keys never leak out?

The wording in my initial post was a tad hard. It's not "never" so much as "if they do and it matters then that's another problem". Often times people complain of insecurity through guessable numbers -- e.g. if your order is 23 then there is likely an order 22 and 24, etc. If this is your "protection" and/or can leak sensitive information then the system is already flawed. (Separating internal and external ids does not inherently fix this issue and authentication/authorization is still required. However, it is one issue raised against using "sequential ids" -- I find encoding a nonce into distributed URLs handles this for my use-case rather well.)

More to what I really wanted to get across: Just because the surrogate PK id happens to be 8942 doesn't mean that it's order 8942. That is, keeping with the "some fields are internal only to db" design, the order "number" might be entirely unrelated on the surface (but fully supported in the DB model), such as "#2010-42c" or whatever makes sense for the business requirement(s). It is this external number that should be exposed in most cases.

I feel that sometimes the generated key is really the true primary key as other fields are mutable (eg. user may change email and username).

This may be the case within a database and I will not argue this statement. However, once again holding that the surrogate PK's are internal to the database, just make sure to only export/import tuples that can be well-identified. If the username/email may change, then this might very well include a UUID assigned upon account creation -- and could very well be the surrogate PK itself.

Of course, as with everything, remain open and fit the model to the problem, not the problem to the model :-) For a service like twitter, for instance, they use their own number generation schema. See Twitter's new ID generation. Unlike [some] UUID generation, the approach by twitter (assuming that all the servers are correctly setup) guarantees that none of the distributed machines/processes will ever generate a duplicate ID, requires only 64-bits, and maintains rough ordering (most significant bits are time-stamp). (The number of records generated by twitter may be in no way related to local requirements ;-)

Happy coding.

Share:
24,718
Tim
Author by

Tim

Updated on October 15, 2020

Comments

  • Tim
    Tim over 3 years

    I'm looking at the pros and cons of these three primary methods of coming up with primary keys for database rows.

    So assuming I am using a database that supports more than one of these methods, is there a simple heuristic to determine what the best option would be for me?

    How do considerations such a distributed/multiple masters, performance requirements, ORM use, security and testing have on the choice?

    Any unexpected drawbacks that one might run into?

    • Hans Westerbeek
      Hans Westerbeek over 13 years
      what kind of language / orm provider do you use?
    • Tim
      Tim over 13 years
      Java. Either Hibernate or Eclipselink. Though I'm really looking mostly for more generic points as to their pluses and minuses.
    • RC_Cleland
      RC_Cleland over 13 years
      The language or access method should not have any bearing on the primary key. The primary key will uniquly identify the object, entity, person, ect. If the primary key is wide, navrchar(25) for me, then use an integer subsitute as an identity column. think about the data which will be stored and model the storage for the data to provide the best OLAP, transaction speed, or reporting speed one needs.
    • Admin
      Admin over 13 years
      @RC_Cleland I used to be more of a "purist", even using compound PKs. Now I view the PK as internal to the database (e.g. an auto-int sequence is fine - GUID is overkill most cases) and impose the relationship requirements as separate unique indexes -- however, it is only the "proper PK" (what would be the PK if there was no auto-column) that I believe should be transmitted between systems, whatever it happens to be. Yet, unlike some, I don't view this as an "absolute" rule and accept (actually, quite enjoy) the different dynamics in varying databases. I also favor immutable records.
    • Hans Westerbeek
      Hans Westerbeek over 13 years
      yes it does have something to do with it... The debate on object identity in the ORM world is lengthy.
    • Tim
      Tim over 13 years
      @RC_Cleland I think there are also some use cases when using an ORM that might result in certain advantages/disadvantages to the options given, but I could be wrong.
  • Tim
    Tim over 13 years
    I think the generated vs. true PK question is a very good one and one I need to consider also. I'd like UUIDs in general to the points you make. My hesitation was in size vs. an int/long. Was not aware of potential indexing de-optimizations, which is a much bigger concern for me. And by "encoded into the schema", do you mean more than adding a uniqueness constraint? And by "distinction between", are you saying that surrogate primary keys never leak out? I feel that sometimes the generated key is really the true primary key as other fields are mutable (eg. user may change email and username).
  • Rick James
    Rick James over 7 years
  • Tianzhou
    Tianzhou almost 4 years