Extending Infosphere Datastage
One of the great features of Datastage Parallel engine is a possibility of extending it by creating custom components.
The following article describes what are the options to build a DataStage PX stage that handles special processing needs not supplied with the native stages.
Many people may ask why build a custom stage?
Basically the main reason is to implement a complex business logic, not easily accomplished using standard Datastage stages or reuse of existing C, C++, Java, COBOL transformations.
-
Datastage jobs can be extended using the following options:
- Wrappers
- Buildops
- Custom Stages (CustomOPS)
Both BuildOPS and CustomOPS are primarly C++ code.
Wrappers
Wrappers are good if you cannot or do not want to modify the application and performance is not critical.
Basically an OS-level legacy executable can be wrapped and turned into a Datastage PX stage (capable of parallel execution within the framework). Examples of commands that can be wrapped: a Binary file, Unix command (ls, grep, etc), Shell script.
There are a few conditions that the legacy executable needs to fulfill:
- amenable to data-partition parallelism
- no dependencies between rows
- pipe-safe
- can read rows sequentially
- no random access to data
Wrappers are treated by Infosphere Datastage as a black box, the application has no knowledge of contents, has no means of managing anything that occurs inside the wrapper, it only knows how to export data to and import data from the wrapper. So it is a user's task to know at design time the intended behavior of the wrapper and its schema interface.
Example: wrapping ls Unix command:
Ls /dwdev/sourcedata would yield a list of files and subdirectories. The wrapper is thus comprised of the command and a parameter that contains a disk location.
Buildops
Buildops are good if users need custom coding but do not need dynamic (runtime-based) input and output interfaces.
Buildop provides a simple means of extending beyond the functionality provided by PX, but does not use an existing executable (like the wrapper).
- Reasons to use Buildop include:
- Speed / Performance
- Complex business logic that cannot be easily represented using existing stages, like advanced lookups across a range of values, custom surrogate key generation, calculating olling aggregates
- Build once and reusable everywhere within project, no shared container necessary
- Can combine functionality from different stages into one
The Datastage interface called buildop automatically performs the tedious, error-prone tasks rquired to compile the program, such as invoke needed header files and build the necessary 'plumbing' for a correct and efficient parallel execution.
The code needs to be Ansi C/C++ compliant. If code does not compile outside of Datastage, it will not compile within Datastage PX either.
The good thing is that BuildOPS are compiled by DataStage itself.
Custom stage
Custom (C++ coding using framework API) is used when there is a need for custom coding and for dynamic input and output interfaces.
- The main reasons for building a custom stage:
- Add PX operator that is not already in DataStage
- Build an Operator and add to DataStage palette
- Use Datastage API
- Use Custom Stage to add new operator to PX canvas
CustomOPS vs BuildOPS
- While Datastage compiles BuildOPS, Datastage developer is responsible for compiling CustomOPS.
- CustomOPS can be vey flexible, BuildOPS are generally used for a specific purpose. CustomOPS are generally used like any normal stages.
- BuildOPS are purely written in C++ , while Custom Operators can use anything as long as landing functions are defined in C++ (thus languages like Pro*C, Fortran, Pascal, etc. can be used).
- BuildOPS are limited to a specific table definition for input and output. Custom Operators have a great flexibility.