Are Surrogate Primary Keys needed on a Fact table in a Data Warehouse?

13,605

Solution 1

Database table without primary key seems like a poor design choice and making lots of room for different types of anomalies i.e. how would you delete or update single record in such table?

Solution 2

A data warehouse is not necessarily a relational data store, although you may choose to make it one, so relational definitions don't necessarily apply.

A primary key is only required if you want to do something with the data that requires a unique identifier (like trace it to a source, but that's not always required or necessary or even possible anyway); and data in a data warehouse can often be used in ways that don't require primary keys. Specifically, you may not need to distinguish rows from each other. Most often for constructing aggregate values.

Time is not a required dimension in constructing data warehouse tables.

It may be psychologically uncomfortable, and wasted space is a trivial issue, but your colleague is correct - PKs aren't necessary.

Solution 3

An identity type column is a "surrogate" key that replaces one of your "candidate" keys (simply put). Adding a surrogate key columns adds nothing if you can't identify a row without it. Which requires a candidate key.

Solution 4

You should at least have a natural key on the fact table so you can identify rows and reconcile them against source or track changes where this is necessary.

On SQL Server an identity column gives you a surrogate key for free and on other systems using sequences (e.g. Oracle) it can be added fairly easily. Surrogate fact table keys can be useful for various different reasons. Some possible applications are:

Some tools like to have numeric keys on fact tables, preferably monotonically increasing ones. An example of this is MS SQL Server Analysis Services, which really likes to have a numeric, monotonically increasing key for fact tables used to populate measure groups. This is especially required for incremental loads.
If you have any relationships between fact tables (for example a written - earned premium breakdown for those familiar with Insurance) then a synthetic key is helpful here.
If you have dimensions living in a M:M relationship with a fact table (e.g. ICD codes) then a numeric key on the fact table simplifies this.
If you have any self-join requirements for transactions (e.g. certain transactions being corrections to others) then a synthetic key will simplify working with these.
If you do contra-restate operations within your data warehouse (i.e. handle changes to transactional data by generating reversals and re-stating the row) then you can have multiple fact table rows for the same natural key.

Otherwise, if you won't have anything joining to your fact table in a 1:M relationship then a synthetic key probably won't be used for anything.

Solution 5

I would agree with you.

"I was told that there is no set of columns in the table that would uniquely identify a record, even if all the columns were selected." - this seems to break something fundamental about relational databases as I understand them.

A fact consists of additive values plus foreign keys to dimensions. Time is an obvious dimension that is common to every dimensional model that I know. If nothing else, a composite key that contains timestamp would certainly be unique enough.

I wonder if your DBAs have much knowledge about dimensional modeling. It's a different way of thinking from the normal relational, transactional style.

View more solutions

13,605

Author by

Chad

Updated on June 04, 2022

Comments

Chad almost 2 years

When I asked our DB designers why our Fact table do not have a PK, I was told that there is no set of columns in the table that would uniquely identify a record, even if all the columns were selected. Whenb I suggested that we an an identity column in that case I was told that "I'd just be wasting space and that it wasn't needed."

My feeling is that every table in the source system should have a PK, even if it is an identity column. Given that the data warehouse (DW) is a recipient of data from other system-how would I otherwise be able to ensure that the data in the DW accurately reflects what is in the source system if there is no way to tie individual records? If you have a runaway load program that screws up data and has run for a week, how would you reconcile the differences with a live transaction source system w/o some sort of unique constraint to compare?
duffymo almost 15 years

Time isn't required, but I haven't seen many that aren't concerned with history.
dkretz almost 15 years

Even for the ones that are, the resolution of the timestamp varies - often only as precise as a year - with many legitimate identical data points having the same value.
Chad almost 15 years

I agree with le dorfier. Hence, time doesn't make transactions necessarily unique.
Chad almost 15 years

"The only thing it would do is protect you against a rogue ETL process" Isn't that consideration alone enough to warrant having a PK?
Walter Mitty almost 15 years

A surrogate autogenerated PK doesn't protect against a rogue ETL process. Some warehouse DBAs trust the ETL programs more than they trust random application programs. That trust might be misplaced, so your point is worth considering.
adolf garlic almost 15 years

Fact table is dimensional database, not relational database
adolf garlic almost 15 years

Sorry, I meant in the sense of having an "ID" column.
duffymo almost 15 years

To be precise about it, a dimensional schema is a particular relational style that uses a combination of dimension and fact tables in a star or snowflake arrangement to represent data.
Chad almost 15 years

I think I agree. That suggests to me that the source transaction system should be passing the logical key (the keys that the end user views as uniquely identifying a record.) How else can you reconcile differences between your Warehouse and source transaction system?
Pondlife over 11 years

But in this case how would you know that that the rows were actually valid data or not? If all attributes are unknown, then what use is the data?
Erwin Smout over 11 years

Fact : that link you provided gives me nothing else than http 404.
David Aldridge about 11 years

In Oracle you'd use the ROWID, so the relevance of that argument varies by RDBMS
David Aldridge about 11 years

Maybe true, but in some cases the uniqueness might rely on a load or source system timestamp as well, and the dimension keys are not guaranteed unique.
David Aldridge about 11 years

You might have cases where all the values are known but you still have duplicates simply because it is more efficient to insert a new record than it is to search for the existence of a record with the same set of n dimensional values and update it.
David Aldridge about 11 years

"You should at least have a natural key on the fact table so you can identify rows and reconcile them against source or track changes where this is necessary" ... That natural key not of course always being unique in the fact table.
David Aldridge about 11 years

This might be true, but the dimension keys might not form a candidate key.
David Aldridge about 11 years

This implies that when a new row is to be added to the fact table then a search should be made to see whether there already exists a row with the same set of unique dimensional values, which could be a tremendously costly business unless a single hashed value based on the keys is also calculated, stored and used for the lookup. I've never really felt to urge to do this, unless it was absolutely required that the row be updated, which is generally not the case as the metrics in fact tables tend to be additive, and a new row (with duplicate key values of the existing row) can be safely used.
David Aldridge about 11 years

The tracking of source system rows to the fact table is surely a different matter to the provision of a unique key on the fact table itself.
Bill Karwin about 11 years

@DavidAldridge, every insert to a non-DW table in an RDBMS checks the table's primary key to ensure against inserting a duplicate row. Primary keys are commonly implemented as indexes, so the check is O(log n). Are you saying DW tables don't do this? Or that it is somehow more costly?
David Aldridge about 11 years

It's more costly than not checking for the presence of a duplicate. It's also often not required on a DW fact table because the ETL process can check to see if an entire batch of records (eg. "Sales for Region 3 for 2012-01-03") has previously been loaded or not. You therefore wouldn't have to check every record. Also, having records with the same set of dimension values is not a problem if every query against the fact table is an aggregation.
Bill Karwin about 11 years

@DavidAldridge, fair enough. It would be interesting to see what the real difference in performance is. It is a good justification for using a specialized storage engine for DW, because a storage engine designed for OLTP may assume storage as an index-organized table, which would incur the cost of the PK regardless.
adolf garlic about 11 years

web.archive.org/web/20100203025032/http://www.ralphkimball.c‌om/…