Creating transformations in Spoon – a part of Pentaho Data Integration (Kettle)

The first lesson of our Kettle ETL tutorial will explain how to create a simple transformation using the Spoon application, which is a part of the Pentaho Data Integration suite.
The transformation in our example will read records from a table in an Oracle database, and then it will filter them out and write output to two separate text files. The records which will pass the validation rule will be spooled into a text file and the ones that won’t will be redirected to the rejects link which will place them in a different text file.

Assuming that the Spoon application is installed correctly, the first thing to do after running it is to configure a repository. Once the ‘Select a repository’ window appears, it’s necessary to create or choose one. A repository is a place where all Kettle objects will be stored – in this tutorial it will be an Oracle database.

To create new repositories click the ‘New’ button and type in connection parameters in the ‘Connection information’ window.
There are some very useful options on the screen, one is ‘Test’ which allows users to test new connections and the other is ‘Explore’ which lets users browse a database schema and explore the database objects.
After clicking the ‘Create or Upgrade’ a new repository is created. By default, an user with administrator rights is created – it’s login name is admin and the password is also admin. It is highly recommended to change the password after the first login.

Database connection in Spoon - a part of Kettle ETL:
Database connection in Spoon - a part of Kettle ETL

If a connection with repository is established successfully, a Spoon main application window will show up.
To design a new transformation which will perform the tasks described above it’s necessary to take the following steps:

  • Click the ‘New transformation’ icon and enter it’s name (in our tutorial it will be trsfCountry)

  • Define a database connection. It is located in the left hand-side menu in the ‘Main tree’ area in the Database connections field

  • Drag and drop the following elements from the ‘Core Objects’ menu to the transformation design area in the center of the screen: Table Input (menu Output), Filter Rows (menu Transform) and two Text Field Output objects (menu Output).
  • Edit the Table Input – choose a source database and define an SQL query which will return records to the transform flow. The ‘Preview’ option is usually very useful here as it shows the preview of the records returned from the database.

    Oracle table input data in Spoon:
    Oracle table input data in Spoon

  • Next thing to do is to link the objects together. The links between elements are called Hops and they indicate which direction the transform flows go. Hops elements can be found, created and edited in the Main Tree section.
    The easiest way to create a Hop is to drag and drop a link between two objects with left SHIFT pressed.

  • Once the hops are defined, it’s time to define validation criteria in the ‘Filter Values’ object. In that place we define the data flow and the direction of that flow based on a validation rule. In our example we want to filter out records with the COUNTRY_TEXT field beginning with a letter S.

    Filtering data in Spoon:
    Filtering data in Spoon

  • The last thing to do is to change the text files output configuration. Enter the names of the files and its extension in the properties window and if needed, adjust other text files specific options here.

  • Save and run the transform (menu -> Transformation -> Run or just press the F9 key). Please find below execution log entries for a correctly configured and run Spoon transform. The record count in all of the flows indicates that the filter worked as expected (the table input has 249 rows, 31 were filtered out to the ‘Countries starting with S’ stream, and the 218 left were rejected (Rejects stream).
    Notice that there are an extra record in both text files. It’s due to the fact that the files are configured to write column headers in the first line.

    Transform execution log :
    Transform execution log