Implementing ETL process in Datastage to load a Data Warehouse


ETL process

From an ETL definition the process involves the three tasks:

ETL process from a Datastage standpoint

In datastage the ETL execution flow is managed by controlling jobs, called Job Sequences. A master controlling job provides a single interface to pass parameter values down to controlled jobs and launch hundreds of jobs with desired parameters. Changing runtime options (like moving project from testing to production environment) is done in job sequences and does not require changing the 'child' jobs.
Controlled jobs can be run in parallel or in serial (when a second job is dependant on the first). In case of serial job execution it's very important to check if the preceding set of jobs was executed successfully.
A normal datastage ETL process can be broken up into the following segments (each of the segments can be realized by a set of datastage jobs):

Datawarehouse master load sequence

Usually the whole set of daily executed datastage jobs is run and monitored by one Sequence job. It's created graphically in datastage designer in a similiar way as a normal server job.
Very often the following job sequencer stages/activities are used to do a master controller:

Example of a master job sequence architecture

It's a good practice to follow one common naming convention of jobs. Job names proposed in the example are clear, easy to sort and to analyze what's the jobs hierarchy.

-Master job controller: SEQ_1000_MAS
--Job sequences accessing source: SEQ_1100_SRC
----loading customers: SEQ_1110_CUS
----loading products: SEQ_1120_PRD
----loading time scale: SEQ_1130_TM
----loading orders: SEQ_1140_ORD
----loading invoices: SEQ_1150_INV
--Job filling up lookup keys : SEQ_1200_LK
----loading lookups: SEQ_1210_LK
--Job sequences for transforming data: SEQ_1300_TRS
----transforming customers (dimension): SEQ_1310_CUS_D
----transforming products (dimension): SEQ_1320_PRD_D
----transforming time scale (dimension): SEQ_1330_TM_D
----transforming orders (fact): SEQ_1340_ORD_F
----transforming invoices (fact): SEQ_1350_INV_F
--Job sequence for loading the transformed data into the DW: SEQ_1400_LD

The master job controller (sequence job) for data warehouse load process SEQ_1000_MAS can be designed as depicted below. Please notice that it will not start until a trigger file is present (WaitFoRFile activity). The extract-transform-load job sequences (each of them may contain server jobs or job sequences) will be triggered in serial fashion (not in paralell) and an email notification will finish the process.

Mater job sequence for loading a DataWarehouse
Mater job sequence for loading a DataWarehouse



Back to the Datastage tutorial