SQL/SSIS DataWareHouse Fact table loading, best practices?
Looks fine. There are options if you start to run into performance issues, but if this is stable (finishes within data-loading time window, source systems aren't being drained of resources, etc), then I see no reason to change.
Some potential issues to keep an eye on...
- having 20+ full-cache lookup-transforms may pose a problem if your dimensions increase in size...due to memory constraints on the SSIS system...but since they are type 1, I wouldn't worry.
- full-cache lookups "hydrate" pre-execution...having 20+ of them may slow you down
A common alternative (to what you have above) is to extract the fact table data from the source system and land it in a staging area before doing the dimension key lookups via a single SQL statement. Some even keep a set of dimension key mapping tables in the staging area specifically for this purpose. This reduces locking/blocking on the source system...if you have a lot of data each load, and have to block the source system while you suck the data out and run it through those 20+ lookup transforms.
Having a good staging area strategy becomes more important when you have a large amount of data, large dimensions, complex key mappings (usually due to multiple source systems), and short data-loading time windows.
Admin
Updated on June 04, 2022Comments
-
Admin almost 2 years
I am building my first datawarehouse in SQL 2008/SSIS and I am looking for some best practices around loading the fact tables.
Currently in my DW I have about 20 Dimensions (Offices, Employees, Products, Customer, etc.) that are of Type 1 SCD. In my dw structure, there are a few things I have already applied:
- No Nulls (replaced with blank for text or 0 for numeric during staging)
- unknown key members populated in each dimension (SK ID 0)
- UPSERT for SCD Type 1 loading from stage to production table
- SELECT DISTINCT for my loading of dimensions
In my Fact loading SSIS project, the current method I have for loading dimensions is having multiple lookups (20+) to each of the DIMs, then populating the FACT table with the data.
For my lookups I set:
- Full Cache
- Ignore Failures for "no matching entries"
- Derived Transformation with "ISNULL(surrogate_idkey) ? 0 : surrogate_idkey" for each SK so that if lookups fail they will default to the SK ID 0 (unknown member).
- Some of my dimension lookups have more than one business key
Is this the best approach? Pictures attached to help with my description above.
-
Admin over 11 yearsThanks Banton, currently we are loading (full dump) 4m records which contain about 200 columns; and about 2k rows of new records each day ; the loading stage is fairly quick. Thanks for the feedback.
-
bonCodigo almost 10 yearsPlease FOLLOW, USE AND SHARE the initiative for dedicated BI site.. I first raised this question in Meta when there was no BI site proposals.