Data quality ETL process

Goal

Implement a data warehouse ETL loading process which will perform data quality and consistency tests and generate reports with records that do not pass the validation rules. This topic describes how to perform basic data cleansing tasks using any ETL tool.

Data quality

The data quality process includes such terms as data cleansing, data validation, data manipulation, data quality tests, data refining, data filtering and tuning. It is a crucial area to maintain in order to keep the data warehouse trustworthy for the business users.
An appropriate definition of the data quality criteria will ensure that the quality of the data is measured, analyzed and subsequently improved. The data quality management in an organization requires great involvement from the business side and very often a lot of manual interventions.
ETL plays a major role in data cleansing and data quality process as it helps automate most of the tasks outlined above.

Scenario details

The data warehouse is fed daily with an orders extract which comes from a source OLTP system. Unfortunately, the data quality in that extract is poor as the source system does not perform much consistency checks and there are no data dictionaries.

The data quality problems that need to be addressed are identified using two types of Data Quality Tests: syntax and reference tests.
The syntax tests will report dirty data based on character patterns, invalid characters, incorrect lower or upper case order, etc.
The reference tests will check the integrity of the data according to the data model. So, for example a customer ID which does not exist in a data warehouse customers dictionary table will be reported.

Also, both types of tests report using two severity levels: errors and warnings.
When an error is encountered, the record is logged and not passed through to the output.
Warnings are logged, however still loaded into the data warehouse.

The data quality problems in this example are as follows:

  • an order clearance date is earlier than the date when the order was placed
  • invalid order number - containing invalid characters or incorrect number of characters
  • incorrect address (postal code and street name) which is not consistent with the dictionary
  • a phone number not matching a pattern - when for example it contains blanks or dashes and the format defined in the DW is supposed to be numbers only

    Solution

    The idea is to capture records containing invalid data in a rejects file for further data quality inspection and analysis.
    In practice, the ETL process design in any of the comercial ETL tools will be pretty straightforward and will include an input, a validation step and two outputs: validated records and dirty data.

    The two validation transformations (Character and reference DQ Tests) depicted below will do the following:

  • Compare order clearance date with order entry date - check whether the order entry date is equal or earlier than the clearance date. Reject records which do not meet this criteria. Severity: error
  • Validate order ID - check if it contains invalid characters. The order ID should be numeric. Severity: error
  • Address validation - the postal code is looked up from a datawarehouse dictionary based on a city name. Discrepancies and not matching records are reported. Severity: warning
  • Phone number correction - we store phone numbers in the following format: +prefix number. Lets consider a number: +34 112223333.
    The following variations will be reported as warnings: (0034) 112223333, 11 2223333, +34 11-22-3333

    ETL process for Data Quality and data cleansing (Pentaho Kettle Spoon):
    ETL process for Data Quality and data cleansing (Pentaho Kettle Spoon)


    Data quality web links

    • Data quality - shows the major challenges addressed by Data Quality tools in a data warehousing environment. It descibes such topics as data correctness, consistency, completeness and validity.
    • Data cleansing - the article provides definitions, loading techniques and source data selection considerations for the typical data cleansing processes.





      Back to the Data Warehousing tutorial home