Processing a header and trailer textfile in Pentaho Data Integration (PDI)



Example case

The input file in our scenario is a stream of records representing invoices. It is a text file in a header-trailer format and has the following structure:
- A header starts with a letter H which is followed by an invoice number, a customer number, a date and currency
- Every item starts with a letter I which is followed by product ID, product age, quantity and net value
- Trailer starts with a T and contains two values and may be used for validation: total number of invoice lines and total net value.

The goal is to fill in two relational tables with invoices data: INVC_HEADERS and INVC_LINES.

We will use Pentaho Data Integration (PDI, also known as Kettle) to design the ETL process.

Pentaho Spoon solution



The final header and trailer transform designed in Pentaho Data Integration is depicted below:
The final header and trailer transform designed in Pentaho Data Integration is depicted below



Detailed analysis of all components of the transform and the ETL process

  • In the first input we select types of records (letters H, I or T) and separate them from the whole data line

    Extract with the invoices:
    Extract with the invoices


  • Then we use a java script object to map the columns properly. The Assign Headers and Lines script is:

    var prev_row;

    var invc_header_id;

    var invc_line_id;

     

    if (prev_row == null) prev_row = row;

    var prevInvHead = prev_row.getString("data_line","-");

     

    if (kind.getString()=='H') {

    invc_header_id = substr(data_line.getString(),0,8);

     

    //Populate invoice header fields

    var cust_id = data_line.Clone().substr(8,14).trim();

    var dte = data_line.Clone().substr(14,22).trim();

    var crncy = data_line.Clone().substr(22,25).trim();

    }

    if (kind.getString()!='I') {

    invc_line_id=0;

    } else {

    invc_line_id++;

    //populate invoice lines fields

    var prod_id = data_line.Clone().substr(1,7).trim();

    var prod_age = data_line.Clone().substr(7,9).trim();

    var quantity = data_line.Clone().substr(9,16).trim();

    var net_value = data_line.Clone().substr(16,27).trim().str2num("0.##",".","");

    }

     



    PDI header and trailer mapping javascript object:
    PDI header and trailer mapping javascript object


  • Two filters to split data flows - one to pass through the headers and the other differenciates records between items and trailers.
    PDI header and trailer mapping javascript object:
    PDI header and trailer mapping javascript object


    ETL process execution log:
    ETL process execution log


    Contents of the invc_headers.csv and invc_items.csv output files
    invc_header_id;cust_id;dte;crncy
    INV01696;R006PL;20080316;EUR
    INV01698;W002UK;20080317;GBP
    
    invc_header_id;invc_line_id;prod_id;prod_age;quantity;net_value
    INV01696;1;CONU01;3;5;22,47
    INV01696;2;TRFO00;9;1;49,99
    INV01698;1;HOBE01;0;100;1199,99
    INV01698;2;RORE02;12;10;149,99
    INV01698;3;LOMA02;6;1;49,99
    INV01698;4;CONU01;3;50;199,49
    




  • Header and trailer example main page

    Back to the Data Warehousing tutorial home