ETL process to load extract and assign surrogate keys

The following spoon Transformation loads the customers extract into a data warehouse and assigns a surrogate (technical) key to all records. As a final step the transform updates the maximum key dictionary table.

ETL process design

The process has been designed in Pentaho Data Integration - Kettle Spoon 3.02.

There are probably many other ways to achieve the same goal, however our intention is also to show how various Spoon components work. For instance, most of the objects might be replaced by a Java Script component to do the calculations and descrease the number of objects.

- CSV input textfile, an extract from the source system
- Datawarehouse table (WH_PARAMS) reference for lookup of the maximum surrogate key

- Output table with newly loaded extract and assigned surrogate keys (D_CUSTOMER)
- New maximum key which updates the reference table (WH_PARAMS)

Please refer to the screenshot below for the transform design:

Surrogate key generation transform design in Kettle Spoon:
Surrogate key generation transform design in Kettle Spoon

Step by step ETL process flow

Below the description of all ETL process steps used in the transformation with the specification of used Spoon objects:

  • CSV file input (CSV Input component) - reads the csv extract with customers data
  • WH surrogate key field (Add constant component) - assigns a constant value to the wh_key_field. The assigned value is wh_cust_no and will be used in the next lookup.
  • Lookup Max Key (Stream Value lookup and Table Input) - those two components get the maximum surrogate key value already stored in the database (wh_key_maxval field).
  • NewCustSeq (Add Sequence) - this step generates a sequence number which starts from 1 and is incremented by every record that passes through it (new_cust_cnt).
  • CalcSurrogateKey (Calculator) - the calculator adds new_cust_cnt to wh_key_maxval and creates a new surrogate key maximum (wh_cust_no)
  • Select Values and D_CUSTOMER (Select Rename Fields and Table Output) - those steps actually feed the customers dimension table in the data warehouse
  • Select Rename Fields and GetNewMaxKey (Select Rename Fields and Group By component) - the newly loaded records are grouped and we select the maximum surrogate key (wh_key_maxval).
  • UpdateKeyLookup (Update component) - updates the maximum surrogate key in the database

    Design of the most significant components in the transform

    LkpMaxKey - Maximum surrogate (technical) key lookup:
    LkpMaxKey - Maximum surrogate (technical) key lookup

    UpdateKeyLookup - update of the maximum surrogate key value in the data warehouse:
    UpdateKeyLookup - update of the maximum surrogate key value in the data warehouse


    2009-03-31 14:47:27 by Hanen:
    Can you provide only the step to create SK and auto_increment it?
    2009-12-25 04:12:55 by vikneswaran:
    can u explain scd in pentaho

  • Generate surrogate key example main page

    Back to the Data Warehousing tutorial home