Do clustered indexes have to be unique?

sql database tsql sql-server-2008 clustered-index

76,550

Solution 1

They don't have to be unique but it certainly is encouraged.
I haven't encountered a scenario yet where I wanted to create a CI on a non-unique column.

What happens if you create a CI on a non-unique column

If the clustered index is not a unique index, SQL Server makes any duplicate keys unique by adding an internally generated value called a uniqueifier

Does this lead to bad performance?

Adding a uniqueifier certainly adds some overhead in calculating and in storing it.
If this overhead will be noticable depends on several factors.

How much data the table contains.
What is the rate of inserts.
How often is the CI used in a select (when no covering indexes exist, pretty much always).

Edit
as been pointed out by Remus in comments, there do exist use cases where creating a non-unique CI would be a reasonable choice. Me not having encountered one off those scenarios merely shows my own lack of exposure or competence (pick your choice).

Solution 2

I like to check out what The Queen of Indexing, Kimberly Tripp, has to say on the topic:

I'm going to start with my recommendation for the Clustering Key - for a couple of reasons. First, it's an easy decision to make and second, making this decision early helps to proactively prevent some types of fragmentation. If you can prevent certain types of base-table fragmentation then you can minimize some maintenance activities (some of which, in SQL Server 2000 AND less of which, in SQL Server 2005) require that your table be offline. OK, I'll get to the rebuild stuff later.....

Let's start with the key things that I look for in a clustering key:

* Unique
* Narrow
* Static

Why Unique? A clustering key should be unique because a clustering key (when one exists) is used as the lookup key from all non-clustered indexes. Take for example an index in the back of a book - if you need to find the data that an index entry points to - that entry (the index entry) must be unique otherwise, which index entry would be the one you're looking for? So, when you create the clustered index - it must be unique. But, SQL Server doesn't require that your clustering key is created on a unique column. You can create it on any column(s) you'd like. Internally, if the clustering key is not unique then SQL Server will “uniquify” it by adding a 4-byte integer to the data. So if the clustered index is created on something which is not unique then not only is there additional overhead at index creation, there's wasted disk space, additional costs on INSERTs and UPDATEs, and in SQL Server 2000, there's an added cost on a clustereD index rebuild (which because of the poor choice for the clustering key is now more likely).

Source: Ever-increasing clustering key debate - again!

Solution 3

Do clustered indexes have to be unique?

They don't, and there are times where it's better if they're not.

Consider a table with a semi-random, unique EmployeeId, and a DepartmentId for each employee: if your select statement is

SELECT * FROM EmployeeTable WHERE DepartmentId=%DepartmentValue%

then it's best for performance if the DepartmentId is the clustered index even though (or even especially because) it's not the unique index (best for performance because it ensures that all the records within a given DepartmentId are clustered).

Do you have any references?

There's Clustered Index Design Guidelines for example, which says,

With few exceptions, every table should have a clustered index defined on the column, or columns, that offer the following:

Can be used for frequently used queries.

Provide a high degree of uniqueness.

Can be used in range queries.

My understanding of "high degree of uniqueness" for example is that it isn't good to choose "Country" as the clusted index if most of your queries want to select the records within a given town.

76,550

Author by

littlegreen

Updated on March 18, 2021

Comments

littlegreen about 3 years

What happens if a clustered index is not unique? Can it lead to bad performance because inserted rows flow to an "overflow" page of some sorts?

Is it "made" unique and if so how? What is the best way to make it unique?

I am asking because I am currently using a clustered index to divide my table in logical parts, but the performance is so-so, and recently I got the advice to make my clustered indexes unique. I'd like a second opinion on that.
Remus Rusanu over 13 years

+1 cause everything you say is correct, but just wanted to add: non-unique CI are quite common when range scans on particular (non-unique) column is the prevalent access pattern.
Lieven Keersmaekers over 13 years

@Remus Rusanu: I was thinking about adding a disclaimer to my scenario statement like but that doesn't mean anything. Thanks for pointing out a scenario where it could be usefull.
littlegreen over 13 years

Yeah that's what I thought until now, but I also get the exact opposite advice, so I wonder which is true. Do you have any references?
littlegreen over 13 years

@Remus: so you mean the niche situation where you have a non-unique column like 'Departmentid' where you query something like 'DepartmentId BETWEEN 1 and 100'? edit ah i see what you mean, yes a date column in a logging table is a good example too.
littlegreen over 13 years

A question though, the Queen recommends a newsequentialid to uniquify the data, but SQL Server generates its own uniquifier if you don't specify it. Is there then still any reason to add your own sequential id?
marc_s over 13 years

@littlegreen: she says if you insist on using GUID's (which are really really bad for use in an clustering index), then at least use newsequentialid() to get a almost sequentialized GUID. But yes: if you add your own unique ID (I always prefer INT IDENTITY), then you have that value at hand, and you can use it (e.g. to establish a FK relationship). The SQL Server added uniquefiers are invisible to you and thus they're only overhead you can't make use of.
ChrisW over 13 years

@littlegreen I edited my answer to try to answer your question.
littlegreen over 13 years

Thanks. Yeah okay, I see your point. But if you are regularly inserting a whole country at once, a clustered index on (country, town) would seem cumbersome to me since it requires sorting the data. On the other hand, a sort before insert wouldn't be that much of trouble...
littlegreen over 13 years

I see. Well that would be an argument in favour of a (CompanyID, DepartmentID, id INT IDENTITY) clustered index instead of just the first two. Thanks!
marc_s over 13 years

@littlegreen: better yet - make your clustered index only on (ID INT IDENTITY) and put the other fields - if needed - into a separate, non-clustered index. The clustered index should be as small as possible - after all, the clustered index columns are being added to each and every entry of each and every non-clustered index on that table, too - so don't waste your bytes with a wide clustered index!
littlegreen over 13 years

Yes but then I lose the benefit of all my department data being grouped, and me being able to insert/delete/retrieve a whole department at once. My data will become scattered and inserts/deletes on whole departments or even whole companies will be slow. My queries only ever run on a single company at once and frequently a whole dataset needs to be updated.
marc_s over 13 years

@littlegreen: true - I guess you'll just have to do some performance testing and see which option works out the best for you.
littlegreen over 13 years

I'll do that. I was also considering table partitioning.. but now I have some new input to think about. Thanks!
Admin over 11 years

Surely in your example, a unique clustered index on {DepartmentID, EmployeeID} would be preferable? Why have the system create a uniqueifier when your existing field would provide uniqueness with less overhead (probably a four-byte INT) and may let you run a few more queries within the index alone?
Shayan C over 5 years

Hey, I've got an event stream table where multiple rows exist with same "AggregateId" which is a column fo type GUID. The only queries performed on table are to get all events for a given AggregateId. I'm wondering should this be a clustered index or a non clustered index?
Lieven Keersmaekers over 5 years

@ShayanC - If performance of retrieval is your primary goal, I'd make it a CI to likely save on IO when retrieving all rows for a given ID. As with all performance scenario's though, the only sure way is to measure.
Triynko almost 4 years

I use a non-unique clustering key for fetching user-boundary data for millions of users. Queries often fetch a handful (1-100) of records for a single user, so the data is clustered on {business id, user id}. That keeps each user's data clustered together on a single data page, making the fetches very i/o efficient when fetching all records for a single user. Incidentally, this also eliminates the need for any kind of secondary indexes for any kind of user data query, because all the data for a user is on a single data page, such that any scan takes place entirely in-memory.