Data masking

Implement a data warehouse data blinding mechanism which would support meaningful test data generation by scrambling, masking and sanitizing data while preserving referential integrity.
The objective of this tutorial lesson is also to show how to assure database privacy and prevent database security breach due to data leakage by internal employees, external consultants or third party service providers.
Furthermore, in many cases data masking is required to comply with data privacy laws and legal regulations.

Business scenario

The company stores confidential information in a data warehouse. Among others, the most confidential are credit card numbers, employees salary information, addresses, phone numbers.

The data available for testing and development needs to be scrambled, masked or anonymized when appropriate.

The data should be blinded/scrambled/masked in a way it still is usable for business testing purposes, preserving referential integrity and business logic

Because there are more than one sources for the data, the credit card numbers need to be scrambled, not randomized. This means that the process needs to ensure that the output is unique and consistent across multiple runs.

Data masking approaches

Goal

Data masking is also very often referred to as: data scrambling , data blinding, data anonymization , data sanitization , data hiding , data encoding.

Some of the most common data scrambling methods used to anonymize data:

Randomization - random data generation, usually in a given range. A modified version of this approach is variance, where each value in a column is modified by random percentage of its original value.

Blocking (substitution) - the data in a column is replaced entirely or partially with artificial records, usually from a lookup dictionary table.

Masking (blinding) - consists of replacing certain fields with a Mask character;
For instance the account number '9064 7891 5459 1190' appears as '**** **** **** 1190'

Scrambling / Hashing - the data type and size is preserved, however the value is completely different.
For example account number 9064 7891 5459 1190 becomes 7234 1900 0000 4188

Shuffling - the substitution data is derived from the column itself. The data in a column is randomly moved between rows.

Solutions and sample implementations

Data sanitization implementation in Pentaho Data Integration - credit card numbers masking and encoding algorithm example

Data scrambling example with the use of PHP and MySQL

Back to the Data Warehousing tutorial home