ETL process to load extract and assign surrogate keys
The following spoon Transformation loads the customers extract into a data warehouse and assigns a surrogate (technical) key to all records. As a final step the transform updates the maximum key dictionary table.
ETL process design
The process has been designed in Pentaho Data Integration - Kettle Spoon 3.02.
There are probably many other ways to achieve the same goal, however our intention is also to show how various Spoon components work.
For instance, most of the objects might be replaced by a Java Script component to do the calculations and descrease the number of objects.
- CSV input textfile, an extract from the source system
- Datawarehouse table (WH_PARAMS) reference for lookup of the maximum surrogate key
- Output table with newly loaded extract and assigned surrogate keys (D_CUSTOMER)
- New maximum key which updates the reference table (WH_PARAMS)
Please refer to the screenshot below for the transform design:
Surrogate key generation transform design in Kettle Spoon:
Step by step ETL process flow
Below the description of all ETL process steps used in the transformation with the specification of used Spoon objects:
CSV file input (CSV Input component) - reads the csv extract with customers data
WH surrogate key field (Add constant component) - assigns a constant value to the wh_key_field. The assigned value is wh_cust_no and will be used in the next lookup.
Lookup Max Key (Stream Value lookup and Table Input) - those two components get the maximum surrogate key value already stored in the database (wh_key_maxval field).
NewCustSeq (Add Sequence) - this step generates a sequence number which starts from 1 and is incremented by every record that passes through it (new_cust_cnt).
CalcSurrogateKey (Calculator) - the calculator adds new_cust_cnt to wh_key_maxval and creates a new surrogate key maximum (wh_cust_no)
Select Values and D_CUSTOMER (Select Rename Fields and Table Output) - those steps actually feed the customers dimension table in the data warehouse
Select Rename Fields and GetNewMaxKey (Select Rename Fields and Group By component) - the newly loaded records are grouped and we select the maximum surrogate key (wh_key_maxval).
UpdateKeyLookup (Update component) - updates the maximum surrogate key in the database
Design of the most significant components in the transform
LkpMaxKey - Maximum surrogate (technical) key lookup:
UpdateKeyLookup - update of the maximum surrogate key value in the data warehouse:
Generate surrogate key example main page