Extending Infosphere Datastage

One of the great features of Datastage Parallel engine is a possibility of extending it by creating custom components. The following article describes what are the options to build a DataStage PX stage that handles special processing needs not supplied with the native stages.

Many people may ask why build a custom stage?
Basically the main reason is to implement a complex business logic, not easily accomplished using standard Datastage stages or reuse of existing C, C++, Java, COBOL transformations.

Wrappers
Buildops
Custom Stages (CustomOPS)

Both BuildOPS and CustomOPS are primarly C++ code.

Wrappers

Wrappers are good if you cannot or do not want to modify the application and performance is not critical.
Basically an OS-level legacy executable can be wrapped and turned into a Datastage PX stage (capable of parallel execution within the framework). Examples of commands that can be wrapped: a Binary file, Unix command (ls, grep, etc), Shell script.

There are a few conditions that the legacy executable needs to fulfill:

amenable to data-partition parallelism
no dependencies between rows
pipe-safe
can read rows sequentially
no random access to data

Wrappers are treated by Infosphere Datastage as a black box, the application has no knowledge of contents, has no means of managing anything that occurs inside the wrapper, it only knows how to export data to and import data from the wrapper. So it is a user's task to know at design time the intended behavior of the wrapper and its schema interface.

Example: wrapping ls Unix command:
Ls /dwdev/sourcedata would yield a list of files and subdirectories. The wrapper is thus comprised of the command and a parameter that contains a disk location.

Buildops

Buildops are good if users need custom coding but do not need dynamic (runtime-based) input and output interfaces.
Buildop provides a simple means of extending beyond the functionality provided by PX, but does not use an existing executable (like the wrapper).

Speed / Performance
Complex business logic that cannot be easily represented using existing stages, like advanced lookups across a range of values, custom surrogate key generation, calculating olling aggregates
Build once and reusable everywhere within project, no shared container necessary
Can combine functionality from different stages into one

The Datastage interface called buildop automatically performs the tedious, error-prone tasks rquired to compile the program, such as invoke needed header files and build the necessary 'plumbing' for a correct and efficient parallel execution.
The code needs to be Ansi C/C++ compliant. If code does not compile outside of Datastage, it will not compile within Datastage PX either. The good thing is that BuildOPS are compiled by DataStage itself.

Custom stage

Custom (C++ coding using framework API) is used when there is a need for custom coding and for dynamic input and output interfaces.

Add PX operator that is not already in DataStage
Build an Operator and add to DataStage palette
Use Datastage API
Use Custom Stage to add new operator to PX canvas

CustomOPS vs BuildOPS

While Datastage compiles BuildOPS, Datastage developer is responsible for compiling CustomOPS.
CustomOPS can be vey flexible, BuildOPS are generally used for a specific purpose. CustomOPS are generally used like any normal stages.
BuildOPS are purely written in C++ , while Custom Operators can use anything as long as landing functions are defined in C++ (thus languages like Pro*C, Fortran, Pascal, etc. can be used).
BuildOPS are limited to a specific table definition for input and output. Custom Operators have a great flexibility.

Back to the Infosphere Datastage tutorial