Data Allocation Pentaho Data Integration example


Example case

The objective of this tutorial example is to make a case study analysis of allocating a measure when loading the data warehouse. A production planning sample data will be used. The following dimensions and measures will be considered:

  • prod_machine_id - ID of the production machine. We assume that our company has 4 different machines able to produce the same parts.
  • product_id - product ID which is produced by one of the machines.
  • produced_parts - number of parts produced in a given time period in the past. This data will serve as a key for allocating a measure.
  • estimated_prod_per_machine - Estimated (planned) production per product
  • alloc_measure - allocated measure represents allocated data based on the estimated_prod_per_machine column, weighted by the produced parts field. The dynamic allocation is performed in a way that all elements for each group (prod_machine_id) sum up to the value in 'estimated production per machine'.
  • The goal is to assign the estimated production value to each product record.



    Data allocation source file contents is listed below. The last column shows the result estimated at the end of the lab:
    Data allocation source file contents is listed below. The last column shows the result estimated at the end of the lab




    The following assumptions have been made for this business scenario:
  • Allocation is weighted which means that it will be based on another measure
  • All calculations and allocation is performed on integers (no decimal places). The sums must match, no values can be lost due to rounding up numbers
  • To show the point of the processing, there are only a few records in the provided example. In real life processing the amount of source data would be a lot greater.
  • The data is clean and consistent. We will not do any data quality issues checking
  • The hardware environment for the study is a dual-core intel laptop with 2GB of RAM running Microsoft Windows XP Professional.
  • We recommend to adjust the maximum memory limit in spoon.bat (set OPT=-Xmx1024m ...) according to the available hardware



  • Pentaho Data Integration transformation

      The algorithm for allocating a measure by another measure is as follows:
    1. Get the source data (ms excel file in this case)
    2. Sort the data by prod_machine_id (the field will be used as a key for grouping)
    3. Group the data by prod_machine_id and make the following aggregates:
      - First value of estimated_prod_per_machine (the number is identical for each prod_machine_id so avg or last as an aggregation type would also do)
      - Summarized number of produced parts, used for calculating weights
      - Number of elements in each group, will be used for recognizing the last element of each group
      - Line number for each group is stored in the Grp_linenr field and the counter restarted for each group
    4. Calculate the allocated measure value using the following formula: (produced_parts/sum_produced_parts)*estimated_prod_per_machine
      To make sure the data is consistent, the last element of each group is calculated in the following way: estimated_prod_per_machine-cumsum (cumulated sum of prevous elements for the group)
    Kettle transformations have been illustrated and described below.

    Detail solution description

    The sample transformation shown in this lab is implemented in Pentaho Data Integration (Spoon).


    Allocation PDI transformation overview. The transfrom gets data from an extract, allocates one measure by another and saves output into a textfile:
    Allocation PDI transformation overview. The transfrom gets data from an extract, allocates one measure by another and saves output into a textfile


    The data is sorted and grouped and variables needed for further processing are calculated:
    The data is sorted and grouped and variables needed for further processing are calculated


    The final data allocation assignment is performed in this simple javascript step.:
    The final data allocation assignment is performed in this simple javascript step.


    The text file output shows an additional measure which is calculated using the allocation algorithm. :
    The text file output shows an additional measure which is calculated using the allocation algorithm.


    Comments

    2010-04-25 22:37:14 by Sumit Pal:
    The output for the column E3 in the Excel spreadsheet shown above should be 37500 and not 27500 - that is a typo


    Please refer to the Data allocation considerations on the data warehousing tutorial main page to get more background information and techniques on allocating data.



    Back to the Data Warehousing tutorial home