Processing a header and trailer textfile in Pentaho Data Integration (PDI)
Example case
The input file in our scenario is a stream of records representing invoices. It is a text file in a header-trailer format and has the following structure:
- A header starts with a letter H which is followed by an invoice number, a customer number, a date and currency
- Every item starts with a letter I which is followed by product ID, product age, quantity and net value
- Trailer starts with a T and contains two values and may be used for validation: total number of invoice lines and total net value.
The goal is to fill in two relational tables with invoices data: INVC_HEADERS and INVC_LINES.
We will use Pentaho Data Integration (PDI, also known as Kettle) to design the ETL process.
Pentaho Spoon solution
Detailed analysis of all components of the transform and the ETL process
var prev_row;
var invc_header_id;
var invc_line_id;
if (prev_row == null) prev_row =
row;
var prevInvHead = prev_row.getString("data_line","-");
if (kind.getString()=='H') {
invc_header_id = substr(data_line.getString(),0,8);
//Populate invoice header fields
var cust_id = data_line.Clone().substr(8,14).trim();
var dte = data_line.Clone().substr(14,22).trim();
var crncy = data_line.Clone().substr(22,25).trim();
}
if (kind.getString()!='I') {
invc_line_id=0;
} else {
invc_line_id++;
//populate invoice lines fields
var prod_id = data_line.Clone().substr(1,7).trim();
var prod_age = data_line.Clone().substr(7,9).trim();
var quantity = data_line.Clone().substr(9,16).trim();
var net_value = data_line.Clone().substr(16,27).trim().str2num("0.##",".","");
}
Contents of the invc_headers.csv and invc_items.csv output files
invc_header_id;cust_id;dte;crncy INV01696;R006PL;20080316;EUR INV01698;W002UK;20080317;GBP invc_header_id;invc_line_id;prod_id;prod_age;quantity;net_value INV01696;1;CONU01;3;5;22,47 INV01696;2;TRFO00;9;1;49,99 INV01698;1;HOBE01;0;100;1199,99 INV01698;2;RORE02;12;10;149,99 INV01698;3;LOMA02;6;1;49,99 INV01698;4;CONU01;3;50;199,49