Data masking

Implement a data warehouse data blinding mechanism which would support meaningful test data generation by scrambling, masking and sanitizing data while preserving referential integrity.
The objective of this tutorial lesson is also to show how to assure database privacy and prevent database security breach due to data leakage by internal employees, external consultants or third party service providers.
Furthermore, in many cases data masking is required to comply with data privacy laws and legal regulations.

Business scenario

  • The company stores confidential information in a data warehouse. Among others, the most confidential are credit card numbers, employees salary information, addresses, phone numbers.
  • The data available for testing and development needs to be scrambled, masked or anonymized when appropriate.
  • The data should be blinded/scrambled/masked in a way it still is usable for business testing purposes, preserving referential integrity and business logic
  • Because there are more than one sources for the data, the credit card numbers need to be scrambled, not randomized. This means that the process needs to ensure that the output is unique and consistent across multiple runs.

    Data masking approaches


    Data masking is also very often referred to as: data scrambling , data blinding, data anonymization , data sanitization , data hiding , data encoding.

    Some of the most common data scrambling methods used to anonymize data:
  • Randomization - random data generation, usually in a given range. A modified version of this approach is variance, where each value in a column is modified by random percentage of its original value.
  • Blocking (substitution) - the data in a column is replaced entirely or partially with artificial records, usually from a lookup dictionary table.
  • Masking (blinding) - consists of replacing certain fields with a Mask character;
    For instance the account number '9064 7891 5459 1190' appears as '**** **** **** 1190'
  • Scrambling / Hashing - the data type and size is preserved, however the value is completely different.
    For example account number 9064 7891 5459 1190 becomes 7234 1900 0000 4188
  • Shuffling - the substitution data is derived from the column itself. The data in a column is randomly moved between rows.

    Solutions and sample implementations

  • Data sanitization implementation in Pentaho Data Integration - credit card numbers masking and encoding algorithm example
  • Data scrambling example with the use of PHP and MySQL


    2009-08-12 16:16:12 by Joe Santangelo:
    Kettle tools are wonderful for data manipulations across multiple sources. If you have more than a small environment, you would want something to sit "on top" of Kettle to provide: a secure data inventory, standardization of algorithms across environments and dynamically adjusting for changes in sources data. In addition, discovery and audit processes are critical to address Privacy requirements.
    DMsuite provides the ability to sit on top of Kettle or other ETL tools and provide a single tool for masking your complete (mainframe and distributed) environment.

    Back to the Data Warehousing tutorial home