SCD - Slowly changing dimensions
Slowly changing dimensions (SCD) determine how the historical changes in the dimension tables are handled.
Implementing the SCD mechanism enables users to know to which category an item belonged to in any given date.
Types of Slowly Changing Dimensions in the Data Warehouse architectures:
- Type 0 SCD is not used frequently, as it is classified as when no effort has been made to deal with the changing dimensions issues. So, some dimension data may be overwritten and other may stay unchanged over the time and it can result in confusing end-users.
- Type 1 SCD DW architecture applies when no history is kept in the database. The new, changed data simply overwrites old entries. This approach is used quite often with data which change over the time and it is caused by correcting data quality errors (misspells, data consolidations, trimming spaces, language specific characters). Type 1 SCD is easy to maintain and used mainly when losing the ability to track the old history is not an issue.
- In the Type 2 SCD model the whole history is stored in the database. An additional dimension record is created and the segmenting between the old record values and the new (current) value is easy to extract and the history is clear. The fields 'effective date' and 'current indicator' are very often used in that dimension.
- Type 3 SCD - only the information about a previous value of a dimension is written into the database. An 'old 'or 'previous' column is created which stores the immediate previous attribute. In Type 3 SCD users are able to describe history immediately and can report both forward and backward from the change.
However, that model can't track all historical changes, such as when a dimension changes twice or more. It would require creating next columns to store historical data and could make the whole data warehouse schema very complex. - Type 4 SCD idea is to store all historical changes in a separate historical data table for each of the dimensions.
- SCD Type 5 involves adding a historical record with a new surrogate key and merging old records into a single historical table, thus preserving the history in a separate table.
- SCD Type 6 combines elements of Types 1, 2, and 3, where historical changes are tracked using additional columns to store current and historical values, often including a flag to indicate the current record.
In order to manage Slowly Changing Dimensions properly and easily it is highly recommended to use Surrogate Keys in the Data Warehouse tables.
A Surrogate Key is a technical key added to a fact table or a dimension table which is used instead of a business key (like product ID or customer ID).
Surrogate keys are always numeric and unique on a table level which makes it easy to distinguish and track values changed over time.
Technically key hash and data hash are used to manage and track changes to dimension records efficiently:
Key hash is a hash value generated from the primary key of a record. It helps in quickly locating records and ensuring that each key is unique, often used in indexing and partitioning strategies.
Data hash is a hash value computed from the entire set of data in a record. It is used to detect changes or verify data integrity by comparing hash values to identify if any data has been altered.
In practice, in big production Data Warehouse environments, mostly the Slowly Changing Dimensions Type 1, Type 2 and Type 3 are considered and used.
It is a common practice to apply different SCD models to different dimension tables (or even columns in the same table) depending on the business reporting needs of a given type of data.
As a follow up you may want to have a look at the following example to see a sample implementation of SCD in Datastage:
Slowly Changing Dimensions - Datastage job designs and examples