Data Sanitization Pentaho Data Integration (PDI) example



Example case

The objective of this tutorial example is to make a case study analysis for the process of credit card numbers anonymization.

The numbers need to be scrambled in a unique and not randomized way. This means that a given credit card number will be masked in the same way each time.

Please refer to the Data masking considerations on our pages to get more background information.

The following assumptions have been made for this test scenario:

  • The credit card number is 16 digits long
  • Random sample of 1 million records containing credit card numbers.
  • The hardware environment for the study is a dual-core intel laptop with 3GB of RAM running Microsoft Windows XP Professional.
  • We recommend to adjust the maximum memory limit in spoon.bat (set OPT=-Xmx1024m ...) according to the available hardware

    Credit card numbers scrambling algorithm

      The algorithm for scrambling credit card numbers is as follows:
    1. Get the original credit card number (16 digits)
    2. Calculate MD5 checksum for this number (32 digit hex)
    3. Remove all the digits from A to F from the hex number. Just replace them by a null string (the length of that number will vary)
    4. Subtract the first 16 digits from the modified hex. If the modified hex is shorter than 16 digits then fill the gap with a digit '9'
    Kettle transformations have been illustrated and described below.

    Detail solution description

    The solutions shown in this tutorial are implemented in Pentaho Data Integration (Spoon).
    There are two transforms to handle the processing:

    1. gen_acc_no.ktr - the transform to generate and scramble credit card numbers
    2. find_dups.ktr - check for duplicates.
      The transform will be also modified to check if the algorithm also works for shorter numbers and to find out how big the population can be to remain unique using this approach.

    Generate and scramble credit card numbers - gen_acc_no transform

    The transform does the following processing:

  • Generate 1 million random and unique accounts.
  • Calculate MD5 checksum for each account number
  • Populate masked credit card numbers according to the algorithm described above
  • Save output to the CSV file

    Gen_acc_no overview. The transfrom generates and masks credit card numbers:
    Gen_acc_no overview. The transfrom generates and masks credit card numbers


    Generate random rows and add a sequence to the flow:
    Generate random rows and add a sequence to the flow


    Javascript component to generate card numbers:

    //set parameters
    var acc_length=str2num(getVariable("ACC_LEN",""));
    var
    fill_char=getVariable("FILL_BLANKS_CHAR","");

    //calculate random modifier
    var rnd_suffix = ""+Math.round(Math.random()*100000);


    //fill in missing characters
    var str_acc_seq=""+acc_seq.getInteger(); //convert to string
    var
    fill_cnt=acc_length-rnd_suffix.length-str_acc_seq.length;


    //repeat string
    var str_tmp="";

    for
    (var i=0;i<fill_cnt;i++) {
        str_tmp=str_tmp+fill_char;
    }


    //generate random and unique account number
    var acc_no = rnd_suffix+str_tmp+str_acc_seq;


    MD5 hash Javascript function in Kettle:

    var acc_no_md5 = new Packages.org.apache.commons.codec.digest.DigestUtils.md5Hex(acc_no.clone());


    Javascript component to modify MD5 hash to generate scrambled numeric credit card number:

    var str_tmp="";
    var fill_char=getVariable("FILL_BLANKS_CHAR",""); //char used to fill blanks

    var acc_no_nohex=acc_no_md5.clone();
    acc_no_nohex.
    replace("a", "").replace("b", "").replace("c", "").replace("d", "").replace("e", "").replace("f", "");

    var acc_no_tmp=""+acc_no_nohex;

    if (acc_no_tmp.length<acc_length.getNumber()) {
       
     //fill in missing characters
       
    var fill_cnt=acc_length.getNumber()-acc_no_tmp.length;

        //repeat string
       
    for (var i=0;i<fill_cnt;i++) {
            str_tmp=str_tmp+fill_char;
        }

        acc_no_tmp=str_tmp+acc_no_tmp;
    }

    var acc_no_scrambled = substr(acc_no_tmp,0,acc_length.getNumber());



    Execution log:
    Execution log


    Generated anonymized accounts file:
    Generated anonymized accounts file


    Find duplicates - find_dups transform

    The transform does the following processing:

  • Read credit cards data from a textfile
  • Sort and group records to find duplicates
  • If found any duplicates, spool them out to a textfile

    find_dups transform overview:
    find_dups transform overview


    Sort and group by to find duplicates:
    Sort and group by to find duplicates


    Execution log. No duplicates found proves that the algorith works fine for this business case:
    Execution log. No duplicates found proves that the algorith works fine for this business case




    Comments

    2010-09-02 11:59:31 by Joe Harris:
    This post ranks highly in Google for 'Kettle compare checksums' however the info is a little out of date. This is what I've found by trying to do this same thing today (Aug 2010 - PDI 4.0).

    First and most importantly you do not need to use the Javascript step to do the core work above in PDI 4.0.

    Create Checksums:
    1. There now a step called "Add a Checksum" in the Transform folder. This is a very simple step and allows you to add a checksum from any number of input fields using CRC, MD5, SHA1, etc.
    2. Less obviously the "Calculator" step, also in the Transform folder, lets you add checksums for a file and do other calculations in a single step. (I have not had occasion to make use of this checksum functionality myself)

    Compare Checksums: (not actually mentioned in this post but very useful)
    This stumped me for an hour or so but the best way to do it is with the "Filter Rows" step. This is actually an extremely useful step in many contexts (so get to know it!). To compare checksums you simply add them to each side of an equality check. E.g. it should read [new_checksum] [=] [old_checksum] and then you send the True results one way and the False results another.

    Hope this helps someone.
    Joe
    2011-09-10 17:53:09 by raj:
    Checksum results does not match when checksum is calculated on multiple columns (concatenated)


  • Data sanitization main page

    Back to the Data Warehousing tutorial home