Generate surrogate key

Goal

Fill in a data warehouse dimension table with data which comes from different source systems and assign a unique record identifier (surrogate key) to each record.

Scenario overview and details

To illustrate this example, we will use two made up sources of information to provide data about customers dimension. Each extract contains customer records with a business key (natural key) assigned to it.

In order to isolate the data warehouse from source systems, we will introduce a technical surrogate key instead of re-using the source system's natural (business) key.
A unique and common surrogate key is a one-field numeric key which is shorter, easier to maintain and understand, and independent from changes in source system than using a business key. Also, if a surrogate key generation process is implemented correctly, adding a new source system to the data warehouse processing will not require major efforts.

Surrogate key generation mechanism may vary depending on the requirements, however the inputs and outputs usually fit into the design shown below:
Inputs:
- an input respresented by an extract from the source system
- datawarehouse table reference for identifying the existing records
- maximum key lookup

Outputs:
- output table or file with newly assigned surrogate keys
- new maximum key
- updated reference table with new records

Proposed solution

Assumptions:
- The surrogate key field for our made up example is WH_CUST_NO.
- To make the example clearer, we will use SCD 1 to handle changing dimensions. This means that new records overwrite the existing data.
The ETL process implementation requires several inputs and outputs.
Input data:
- customers_extract.csv - first source system extract
- customers2.txt - second source system extract
- CUST_REF - a lookup table which contains mapping between natural keys and surrogate keys
- MAX_KEY - a sequence number which represents last key assignment

Output data:
- D_CUSTOMER - table with new records and correctly associated surrogate keys
- CUST_REF - new mappings added
- MAX_KEY sequence increased


The design of an ETL process for generating surrogate keys will be as follows:

  • The loading process will be executed twice - once for each of the input files
  • Check if the lookup reference data is correct and available:
       - PROD_REF table
       - max_key sequence
  • Read the extract and first check if a record already exists. If it does, assign an existing surrogate key to it and update the desciptive data in the main dimension table.
  • If it is a new record, then:
       - populate a new surrogate key and assign it to the record. The new key will be populated by incrementing the old maximum key by 1.
       - insert a new record into the products table
       - insert a new record into the mapping table (which stores business and surrogate keys mapping)
       - update the new maximum key

    Sample Implementations

    Generation of surrogate key implementation in various ETL environments:
  • PDI surrogate key - surrogate key generation example implemented in Pentaho Data Integration

    Comments

    2010-04-18 01:07:16 by nate:
    thanks for posting this article.
    I have a similar scenario, but in my case, i have to execute the two process in paralell as opposed to the way you have suggested doing it in sequential.
    Please suggest, while running in paralell, how to ensure that we don't end up generating duplicate keys.
    2010-12-22 11:03:52 by nate:
    thanks for posting this article. I have a similar scenario, but in my case, i have to execute the two process in paralell as opposed to the way you have suggested doing it in sequential. Please suggest, while running in paralell, how to ensure that we don't end up generating duplicate keys.



    Back to the Data Warehousing tutorial home