SQL/SSIS DataWareHouse Fact table loading, best practices?

sql sql-server ssis lookup data-warehouse

10,382

Looks fine. There are options if you start to run into performance issues, but if this is stable (finishes within data-loading time window, source systems aren't being drained of resources, etc), then I see no reason to change.

Some potential issues to keep an eye on...

having 20+ full-cache lookup-transforms may pose a problem if your dimensions increase in size...due to memory constraints on the SSIS system...but since they are type 1, I wouldn't worry.
full-cache lookups "hydrate" pre-execution...having 20+ of them may slow you down

A common alternative (to what you have above) is to extract the fact table data from the source system and land it in a staging area before doing the dimension key lookups via a single SQL statement. Some even keep a set of dimension key mapping tables in the staging area specifically for this purpose. This reduces locking/blocking on the source system...if you have a lot of data each load, and have to block the source system while you suck the data out and run it through those 20+ lookup transforms.

Having a good staging area strategy becomes more important when you have a large amount of data, large dimensions, complex key mappings (usually due to multiple source systems), and short data-loading time windows.

10,382

Author by

Admin

Updated on June 04, 2022

Comments

Admin almost 2 years
I am building my first datawarehouse in SQL 2008/SSIS and I am looking for some best practices around loading the fact tables.

Currently in my DW I have about 20 Dimensions (Offices, Employees, Products, Customer, etc.) that are of Type 1 SCD. In my dw structure, there are a few things I have already applied:
- No Nulls (replaced with blank for text or 0 for numeric during staging)
- unknown key members populated in each dimension (SK ID 0)
- UPSERT for SCD Type 1 loading from stage to production table
- SELECT DISTINCT for my loading of dimensions
In my Fact loading SSIS project, the current method I have for loading dimensions is having multiple lookups (20+) to each of the DIMs, then populating the FACT table with the data.

For my lookups I set:
- Full Cache
- Ignore Failures for "no matching entries"
- Derived Transformation with "ISNULL(surrogate_idkey) ? 0 : surrogate_idkey" for each SK so that if lookups fail they will default to the SK ID 0 (unknown member).
- Some of my dimension lookups have more than one business key
Is this the best approach? Pictures attached to help with my description above.
Admin over 11 years

Thanks Banton, currently we are loading (full dump) 4m records which contain about 200 columns; and about 2k rows of new records each day ; the loading stage is fairly quick. Thanks for the feedback.
bonCodigo almost 10 years

Please FOLLOW, USE AND SHARE the initiative for dedicated BI site.. I first raised this question in Meta when there was no BI site proposals.