What are the best practices for using a GUID as a primary key, specifically regarding performance?

187,963

Solution 1

GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.

You really need to keep two issues apart:

  1. the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.

  2. the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.

By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

Quick calculation - using INT vs. GUID as Primary and Clustering Key:

  • Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
  • 6 nonclustered indexes (22.89 MB vs. 91.55 MB)

TOTAL: 25 MB vs. 106 MB - and that's just on a single table!

Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.

PS: of course, if you're dealing with just a few hundred or a few thousand rows - most of these arguments won't really have much of an impact on you. However: if you get into the tens or hundreds of thousands of rows, or you start counting in millions - then those points become very crucial and very important to understand.

Update: if you want to have your PKGUID column as your primary key (but not your clustering key), and another column MYINT (INT IDENTITY) as your clustering key - use this:

CREATE TABLE dbo.MyTable
(PKGUID UNIQUEIDENTIFIER NOT NULL,
 MyINT INT IDENTITY(1,1) NOT NULL,
 .... add more columns as needed ...... )

ALTER TABLE dbo.MyTable
ADD CONSTRAINT PK_MyTable
PRIMARY KEY NONCLUSTERED (PKGUID)

CREATE UNIQUE CLUSTERED INDEX CIX_MyTable ON dbo.MyTable(MyINT)

Basically: you just have to explicitly tell the PRIMARY KEY constraint that it's NONCLUSTERED (otherwise it's created as your clustered index, by default) - and then you create a second index that's defined as CLUSTERED

This will work - and it's a valid option if you have an existing system that needs to be "re-engineered" for performance. For a new system, if you start from scratch, and you're not in a replication scenario, then I'd always pick ID INT IDENTITY(1,1) as my clustered primary key - much more efficient than anything else!

Solution 2

I've been using GUIDs as PKs since 2005. In this distributed database world, it is absolutely the best way to merge distributed data. You can fire and forget merge tables without all the worry of ints matching across joined tables. GUIDs joins can be copied without any worry.

This is my setup for using GUIDs:

  1. PK = GUID. GUIDs are indexed similar to strings, so high row tables (over 50 million records) may need table partitioning or other performance techniques. SQL Server is getting extremely efficient, so performance concerns are less and less applicable.

  2. PK Guid is NON-Clustered index. Never cluster index a GUID unless it is NewSequentialID. But even then, a server reboot will cause major breaks in ordering.

  3. Add ClusterID Int to every table. This is your CLUSTERED Index... that orders your table.

  4. Joining on ClusterIDs (int) is more efficient, but I work with 20-30 million record tables, so joining on GUIDs doesn't visibly affect performance. If you want max performance, use the ClusterID concept as your primary key & join on ClusterID.

Here is my Email table...

CREATE TABLE [Core].[Email] (
    [EmailID]      UNIQUEIDENTIFIER CONSTRAINT [DF_Email_EmailID] DEFAULT (newsequentialid()) NOT NULL,        
    [EmailAddress] NVARCHAR (50)    CONSTRAINT [DF_Email_EmailAddress] DEFAULT ('') NOT NULL,        
    [CreatedDate]  DATETIME         CONSTRAINT [DF_Email_CreatedDate] DEFAULT (getutcdate()) NOT NULL,      
    [ClusterID] INT NOT NULL IDENTITY,
    CONSTRAINT [PK_Email] PRIMARY KEY NonCLUSTERED ([EmailID] ASC)
);
GO

CREATE UNIQUE CLUSTERED INDEX [IX_Email_ClusterID] ON [Core].[Email] ([ClusterID])
GO

CREATE UNIQUE NONCLUSTERED INDEX [IX_Email_EmailAddress] ON [Core].[Email] ([EmailAddress] Asc)

Solution 3

I am currently developing an web application with EF Core and here is the pattern I use:

All my classes (tables) have an int PK and FK. I then have an additional column of type Guid (generated by the C# constructor) with a non clustered index on it.

All the joins of tables within EF are managed through the int keys while all the access from outside (controllers) are done with the Guids.

This solution allows to not show the int keys on URLs but keep the model tidy and fast.

Solution 4

This link says it better than I could and helped in my decision making. I usually opt for an int as a primary key, unless I have a specific need not to and I also let SQL server auto-generate/maintain this field unless I have some specific reason not to. In reality, performance concerns need to be determined based on your specific app. There are many factors at play here including but not limited to expected db size, proper indexing, efficient querying, and more. Although people may disagree, I think in many scenarios you will not notice a difference with either option and you should choose what is more appropriate for your app and what allows you to develop easier, quicker, and more effectively (If you never complete the app what difference does the rest make :).

https://web.archive.org/web/20120812080710/http://databases.aspfaq.com/database/what-should-i-choose-for-my-primary-key.html

P.S. I'm not sure why you would use a Composite PK or what benefit you believe that would give you.

Solution 5

If you use GUID as primary key and create clustered index then I suggest use the default of NEWSEQUENTIALID() value for it.

Share:
187,963
VAAA
Author by

VAAA

Updated on July 20, 2022

Comments

  • VAAA
    VAAA almost 2 years

    I have an application that uses GUID as the Primary Key in almost all tables and I have read that there are issues about performance when using GUID as Primary Key. Honestly, I haven't seen any problem, but I'm about to start a new application and I still want to use the GUIDs as the Primary Keys, but I was thinking of using a Composite Primary Key (The GUID and maybe another field.)

    I'm using a GUID because they are nice and easy to manage when you have different environments such as "production", "test" and "dev" databases, and also for migration data between databases.

    I will use Entity Framework 4.3 and I want to assign the Guid in the application code, before inserting it in the database. (i.e. I don't want to let SQL generate the Guid).

    What is the best practice for creating GUID-based Primary Keys, in order to avoid the supposed performance hits associated with this approach?

  • VAAA
    VAAA over 11 years
    Totally agree!! But that means that if I have a GUID as PK or a Composite PK with GUID and other field is going to be the same right?
  • Matt
    Matt over 11 years
    The PK (index) would be made up of the two columns, but unless you have some business specific reason for doing this, it seems unnecessary.
  • Matt
    Matt over 11 years
    BTW this question is one of the most polarizing and debated questions out there and therefore extremely difficult to get an answer for that you will feel 100% comfortable with. Either method comes with trade-offs, so good luck :)
  • Andrew Theken
    Andrew Theken about 10 years
    This is a great answer, one thing I'd mention is that being able to generate the key before insert is frequently useful. Using "newsequentialid()" can help with the clustering, but that requires an additional round-trip to SQL. So another benefit of the "surrogate key" approach is that you can generate new ids, client-side, with fewer index fragmentation concerns.
  • Fred Lackey
    Fred Lackey almost 10 years
    Just curious. Would storing the GUID as a char(32) or char(36) PK solve this? Why / why not?
  • marc_s
    marc_s almost 10 years
    @FredLackey: no - same problem - since the key is totally random, excessive index fragmentation will happen. Just don't do it.
  • pinkfloydx33
    pinkfloydx33 over 9 years
    The way I read this is that having both a non clustered uniqueidentifier column and the int identity column, FK's should also be uniqueidentifier? If you do that, when would you actually use the identity column directly, or would you not?
  • A_L
    A_L about 9 years
    @marc_s If my GUID pk is non-clustered and I use this to join my tables (for portability across databases) then the clustered int index is redundant right? Do you foresee any problem with having no unique clustered index and doing as I describe?
  • marc_s
    marc_s about 9 years
    @A_L: yes, a table without a clustering index is a heap - and that's really really bad for many reasons. Don't just toss your clustered index - it's important on so many levels!
  • Nicolas Belley
    Nicolas Belley almost 9 years
    Little question, should the GUID now be used on joins, or the int id? My instinct tells me the GUID should be used, but I fail to see a technical problem using the int id...
  • marc_s
    marc_s almost 9 years
    @NicolasBelley: the int is probably a bit more efficient, since it's 4x smaller in sheer size ...
  • Nicolas Belley
    Nicolas Belley almost 9 years
    @marc_s but in a replication scenario, if the int column is identity, shouldn't we use the GUID since the int column can repeat itself across devices?
  • Derek Greer
    Derek Greer almost 9 years
    Great information, but as with most things, the right choice depends upon the needs of your application. I do however feel like the article would be more balanced with discussion of the GuidComb strategy. The "cheap space isn't the point" article also has some great information, but I'm not sure the test scenarios are optimal for application developers. Using Identity over generating the key app side invariably leads to round-trips. I think comparisons around typical DDD object graph persistence scenarios would help give a more accurate picture to help with the decision process.
  • Nick.McDermaid
    Nick.McDermaid almost 8 years
    This is an old thread, but might I add: don't just use a useless arbitrary INT as the clustering key. Use something useful like an incerementing date that is actually searched on, that has some relation to the data you're storing. You only get one clustering key, and if you choose the right one you'll get good performance
  • Phil
    Phil over 6 years
    Could you explain the PK_Email constraint? Why you have ... NonClustered(EmailID ASC) instead of ...Nonclustered(ClusterID ASC) ?
  • Robert J. Good
    Robert J. Good over 6 years
    You bet. Two main things going on with indexes: 1. Clustered on ClusterID - Orders your table on disk (0% fragmentation). 2. NonClustered on EmailID - Indexes the EmailID field to speed up GUID ID lookups. A GUID field lookup behaves string-ish, so a EmailID lookup would be slow without the index.
  • Kip ei
    Kip ei over 6 years
    @marc_s although outdated, I totally agree with Nick.McDemermaid, but maybe I am missing something?! I am very curious about you're opinion an what Nick.McDemermaid has to say!!
  • marc_s
    marc_s over 6 years
    @Kipei: the main issues is the I-F you have such a natural value - then yes, you can use it as a primary key. BUT: values like DATETIME for instance are NOT useful for a clustering key, since they have a 3.33ms accuracy only, and thus duplicates can exist. So in such a case, you *still need an INT IDENTITY instead - therefore, I typically use that by default, since frmo my 20+ years of experience, a really usable natural key hardly ever really exists ....
  • Anyname Donotcare
    Anyname Donotcare almost 6 years
    I currently work on EF6 Code first web application (sql server db 2012 0r 2017) and want to apply DDD concepts so I need a unique key in advance before inserting in DB, Many recommends UUID But I'm afraid of the performance issue, Could u help me please to take the right decision. Should I use GUID instead of auto increments key
  • marc_s
    marc_s almost 6 years
    @AnynameDonotcare: I'd still strongly recommend using an auto-increment INT or BIGINT. What makes you think you must know the ID value before saving?? I fail to see how DDD or any other design approach mandates this......
  • Anyname Donotcare
    Anyname Donotcare almost 6 years
    I've learned that I should keep my model in a valid state and when creating a new instance of a class it's recomended to put all required attributes as constructor parameters. So I converted all my IDs to GUID instead of int, Should I use both of them ? one as auto increment to solve the technical issue and the GUID to solve the DDD issue ?
  • Allen Wang
    Allen Wang almost 6 years
    Is there anything you need to do to configure the integer pK as clustered, like data annotations, or is it just automatically configured?
  • Trong Phan
    Trong Phan almost 5 years
    What the name of the property do you use for Guid one?
  • Dale K
    Dale K almost 5 years
    @RobertJ.Good I've seen this method discussed before i.e. adding a surrogate int key to cluster on. But I can't find anywhere which shows the performance gain in having a surrogate key clustered index over using a heap. Do you have any links to benchmark data?
  • Dale K
    Dale K almost 5 years
    @marc_s I've seen this method of adding a surrogate int key to cluster on discussed before. But I can't find anywhere which shows the performance gain in having a surrogate key clustered index over using a heap. Do you have any links to benchmark data? I know everyone says a heap is bad and performs worse, but also it seems there are many opinions out there.
  • marc_s
    marc_s almost 5 years
    @DaleBurrell: check out sqlskills.com/blogs/kimberly/… - and any blog post by Kimberly Tripp for that matter - she's the "Queen of indexing" and I'm pretty sure there's performance testing numbers in her blog somewhere to show this very clearly
  • Dale K
    Dale K almost 5 years
    Thanks @marc_s - I have read most of her stuff and didn't see data to back that up. But will look again. on your link she says "Oh – and if you arbitrarily add a column to use for clustering (maybe not as the primary key) that can help" - but doesn't expand on that.
  • Robert J. Good
    Robert J. Good over 4 years
    Hi @DaleBurrell, the clustered index is to prevent table fragmentation. Performance gain happens as the table naturally grows in order on disk, with low fragmentation.
  • dariol
    dariol over 4 years
    @RobertJ.Good Is that a web application? What are you using in urls/hrefs? guid or int?
  • Robert J. Good
    Robert J. Good over 4 years
    @dariol There are security implications, so drop the newsequentialid() and expose a NewId() Guid if no other choice (definitely not the Int.) I'd recommend a claims based and/or token approach, or even brute-force encryption for any identifiers. In short, avoid exposing any Ids, and avoid any value that can be guessed, or worse +1 to find the next record.
  • jonnarosey
    jonnarosey over 4 years
    Can you provide any logic or evidence to back up this claim? I'm struggling to see how a sequential id might compromise security.
  • Zonus
    Zonus over 4 years
    Sure, if you know ID numbers are integer you can guess sequentially records in a DB. So if you query a single item, you can say that the next item is pk + 1. If you have random GUIDS, it will not follow a pattern. It would be nearly impossible to query other records than the one you previously queried (And know the PK).
  • jfrobishow
    jfrobishow over 4 years
    @RobertJ.Good when you mention "In this distributed database world, it is absolutely the best way to merge distributed data." do you mean you eventually merge the records to a master database? Wondering what happens the the clusterID then, how do you handle duplicates once you merge the "source"?
  • jonnarosey
    jonnarosey over 4 years
    If a hacker can query your database you're already compromised, I fail to see how sequential id's make the situation worse.
  • Zonus
    Zonus over 4 years
    No. That is not true. I do a lot with pen testing and am well known for catching hackers. Do I always use GUIDs and not int, no. But if I need to protect data I will rely on data techniques to protect as well as programming and there's a LOT of reasons to do this. Take this older example. When you see a URL like www.domain.com/edit/1012 that makes me cringe. The 1012 is the key of the record and can be switched out. With reactive sites this changes a bit but can still be seen. And it's complex to protect records at that point. Know your data and protect what needs to be protected.
  • jonnarosey
    jonnarosey over 4 years
    If a user can switch out 1012 for another number and see data they shouldn't then there is a very serious security issue, that issue isn't caused by the primary key choice but it is exacerbated by it. I do take your point, thank you for spelling it out.
  • Zonus
    Zonus over 4 years
    I highlight this as a requirement for any app that deals with hypersensitive data like HIPAA or SOX. Relying on only programming as a security restraint is dangerous. It's best to use multiple methods when protecting sensitive data.
  • Panos Roditakis
    Panos Roditakis over 4 years
    You may use a GUID to locate a record at the web page, that is not the PK of the table. Using query parameter in a website should not define how you structure your DB schema. The PK has nothing to do with input and parameters in UI or backend system.
  • genuinefafa
    genuinefafa almost 4 years
    why would you do that?
  • Cesar Alvarado Diaz
    Cesar Alvarado Diaz over 3 years
    If you receive the Guid in the Controller, how do you access it if you don't know the associated int? do you do the sequential search in the Guid column?
  • Argeman
    Argeman about 3 years
    While your answer is technically correct, it is not an answer to the question.
  • Jeff Moden
    Jeff Moden almost 3 years
    I mostly agree but "It Depends". The thing is that even IDENTITY and Date Columns used for Clustered Indexes have sometimes even worse problems with fragmentation because of the horrible habit of doing an INSERT and then following that with an "ExpAnsive" update on the rows just inserted. Massive fragmentation is guaranteed and instantaneous. People have to design correctly even if they avoid Random GUIDs. Oddly enough, Random GUID Clustered Indexes about the Insert/Update fragmentation problem for months at a time instead of it being instantaneous.
  • Dave Cousineau
    Dave Cousineau over 2 years
    @jfrobishow I would guess that the clusterID would just be an auto-generated identity both on clients and on the server. the actual value is basically meaningless so does not need to match between sides. (he mentions joining on it, which I think should still work as long as you are not joining across the client/server boundary).
  • INF.TA
    INF.TA over 2 years
    @dale_k. Sequential clustering index helps during insert because inserts occur at only the end of the linked list of leaf nodes in the storage B-tree, and helps during index scans because the logical ordering of the records matches the physical ordering of the data on disk. Ref. If your scenarios don't involve these operations, you may not benefit much from an ever-increasing clustered index key.
  • variable
    variable over 2 years
    What purpose does creating second index that's defined as CLUSTERED serve?
  • Caius Jard
    Caius Jard over 2 years
    This is "security by obscurity", which is not an adequate substitute for proper segregation of data security bounds. A correctly written system won't allow the owner of record 1012 to access record 1013 if they don't own it so the sequentiality matters not.
  • Quinton Smith
    Quinton Smith almost 2 years
    This split will unfortunately not work if you require a spatial index on the table too. From MS docs - Spatial indexes require the table to have a clustered primary key.
  • Norbert Norbertson
    Norbert Norbertson almost 2 years
    I don't really understand the clusterId thing if the point is to make like easier for merges. Won't you get conflicts when you merge tables that have this Id field?