Data scrambling using PHP and MySQL

Business case

The business background and algorithms used in this tutorial is the same as in the Pentaho Data Integration sanitization example on our pages.
Please also refer to the Data blinding considerations on our Data Warehousing Tutorial pages for more background information.

PHP and MySQL solution

Scrambling algorithm

The data scrambling algorithm for generating encoded numbers is as follows:

  • Calculate MD5 checksum for the input number (32 digit hex)
  • Remove all the digits from A to F from the hex number.
  • Subtract the first x left hand side digits from the modified hex. If needed, fill the gap with a digit '0'

    Environment setup

  • Install and configure Apache + PHP server. We used Apache 2and PHP 5 for this tutorial lesson.
  • Change the max_execution_time in php.ini
  • Install and configure the MySQL database. Once it is up and running, create a MySQL table for our example. We used the following SQL script to generate it:
    CREATE TABLE etltoolsinfo.scrambled_accounts (
    `
    acc_no` varchar(16) collate latin1_general_ci NOT NULL,
    `
    md5_hash` varchar(32) collate latin1_general_ci NOT NULL,
    `
    md5_dec` varchar(32) collate latin1_general_ci default NULL
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;
  • Make sure the Apache + PHP + MySQL environment works (use the phpinfo() PHP function to see the settings)

    Implementation

    The PHP code to do the processing is attached below. It contains comments which will help understand how the program works:

    <?php

    /************ PHP MySQL data scrambling ************/

     

    //parameters

    $rec_cnt=1000000;

    $acc_len=16;

     

    //check the max execution time

    echo ini_get('max_execution_time')."<br>";

     

    //connect to the database and select the database

    $db=mysql_connect ("localhost", "etl-tools", ".info");

    mysql_select_db ("etltoolsinfo");

     

    //flush the table first

    mysql_query("delete from scrambled_accounts;")

                or die('Error, delete query failed');

     

     

    //main loop

    for ($i=0;$i<$rec_cnt;$i++) {

          // generate o 16-digit random account number. A sequence is used to avoid generating duplicates.

          //function definition attached below the code

          $acc=gen_accnumber(0+$i,$acc_len);

         

          //calculate a md5 sum

          $h_md5 = md5($acc);

     

               

          //replace A-F hex digits    

          $acc_m=str_replace('a','',$h_md5);

          $acc_m=str_replace('b','',$acc_m);

          $acc_m=str_replace('c','',$acc_m);

          $acc_m=str_replace('d','',$acc_m);

          $acc_m=str_replace('e','',$acc_m);

          $acc_m=str_replace('f','',$acc_m);

          $acc_m=substr($acc_m,0,$acc_len);

     

          // construct sql statement

          $sql="INSERT INTO

                scrambled_accounts(acc_no,md5_hash,md5_dec)

                VALUES ('".$acc."','".$h_md5."','".$acc_m."');";

     

          // run query or stop and raise an error

          mysql_query($sql) or die('Error, insert query failed');

     

    //loop through

    }

     

    // print output

    echo "Processing finished";

     

     

    //function definition

    function gen_accnumber($tmp,$il_chars){

          //generate random number

          $dokl=rand(0,999999);

         

          //number of chars to add to the main number

          $ile = $il_chars-strlen($tmp)-strlen($dokl);

          $tmpdokl="";

         

          //add zeros where appropriate

          for ($j=0;$j<$ile;$j++) {

                $tmpdokl="0".$tmpdokl;

          }

    //return result

    return $dokl.$tmpdokl.$tmp;

     

    }

     

    ?>

     

     

    Execution and testing

  • Run the php script in any web browser by typing in the correct url. In our case it is:
    http://127.0.0.1/scramble_accounts.htm
  • While the script is running you can check its status by running the following statement in MySQL command line:
    SELECT count(1) acc_cnt FROM etltoolsinfo.scrambled_accounts
  • Check the algorithm and the whole process by counting duplicates. If it shows 0 then it is ok



  • Data sanitization main page

    Back to the Data Warehousing tutorial home