Data Allocation Pentaho Data Integration example
The objective of this tutorial example is to make a case study analysis of allocating a measure when loading the data warehouse. A production planning sample data will be used. The following dimensions and measures will be considered:
prod_machine_id - ID of the production machine. We assume that our company has 4 different machines able to produce the same parts.
product_id - product ID which is produced by one of the machines.
produced_parts - number of parts produced in a given time period in the past. This data will serve as a key for allocating a measure.
estimated_prod_per_machine - Estimated (planned) production per product
alloc_measure - allocated measure represents allocated data based on the estimated_prod_per_machine column, weighted by the produced parts field. The dynamic allocation is performed in a way that all elements for each group (prod_machine_id) sum up to the value in 'estimated production per machine'.
The goal is to assign the estimated production value to each product record.
Data allocation source file contents is listed below. The last column shows the result estimated at the end of the lab:
The following assumptions have been made for this business scenario:
Allocation is weighted which means that it will be based on another measure
All calculations and allocation is performed on integers (no decimal places). The sums must match, no values can be lost due to rounding up numbers
To show the point of the processing, there are only a few records in the provided example. In real life processing the amount of source data would be a lot greater.
The data is clean and consistent. We will not do any data quality issues checking
The hardware environment for the study is a dual-core intel laptop with 2GB of RAM running Microsoft Windows XP Professional.
We recommend to adjust the maximum memory limit in spoon.bat (set OPT=-Xmx1024m ...) according to the available hardware
Pentaho Data Integration transformation
The algorithm for allocating a measure by another measure is as follows:
- Get the source data (ms excel file in this case)
- Sort the data by prod_machine_id (the field will be used as a key for grouping)
- Group the data by prod_machine_id and make the following aggregates:
- First value of estimated_prod_per_machine (the number is identical for each prod_machine_id so avg or last as an aggregation type would also do)
- Summarized number of produced parts, used for calculating weights
- Number of elements in each group, will be used for recognizing the last element of each group
- Line number for each group is stored in the Grp_linenr field and the counter restarted for each group
- Calculate the allocated measure value using the following formula: (produced_parts/sum_produced_parts)*estimated_prod_per_machine
To make sure the data is consistent, the last element of each group is calculated in the following way: estimated_prod_per_machine-cumsum (cumulated sum of prevous elements for the group)
Kettle transformations have been illustrated and described below.
Detail solution description
The sample transformation shown in this lab is implemented in Pentaho Data Integration (Spoon).
Allocation PDI transformation overview. The transfrom gets data from an extract, allocates one measure by another and saves output into a textfile:
The data is sorted and grouped and variables needed for further processing are calculated:
The text file output shows an additional measure which is calculated using the allocation algorithm. :
Please refer to the Data allocation
considerations on the data warehousing tutorial main page to get more background information and techniques on allocating data.