XML ETL Processing
XML
This section introduces XML (Extensible Markup Language) in data processing and explains basic XML concepts which might help understand the use of various ETL tools to process the XML data.
The sample business case shows how to implement ETL process to process the XML data from a web-based front-end to an OLTP system and to process the operational data periodically.
XML data processing concepts
The success and broad use of XML mainly derives from it being platform independent, text based, straightforward, easy to understand, human readable and user defined.
By far the most widely known and encountered form of XML is HTML, a markup language for web pages.
In data integration, the XML standard is very often used to communicate between applications.
Most commercial ETL tools have dedicated components to handle the XML processing.
Each ETL process for the XML processing task comes with its own challenges and requires its own techniques.
Most ETL tools usually provide various approaches to process and transform XML data.
However, most XML ETL processing tasks tend to yield to one of the two ways the data is interpreted and represented:
The Full Object model representation is more sophisticated, the most powerful and useful choice when working with complex XML schemas.
Business Scenario
The hypothetical manufacturing company uses two operational systems:
These purchase orders are stored in a designated directory on the web server and the whole collection is processed on a daily basis (overnight).
The aim is to load the purchase orders data from the multiple XML files into an operational table with company's orders.
Proposed solution
In rough outline, the ETL process involves the following steps:
a. Parse XML file using the event XML approach
b. Pass through additional data to the processing flow, like the file name which contains dates and a timestamp
c. Validate records (check for example if a customer exists in the database and if the data is correct)
d. Assign order number to each purchase order
e. Feed the OLTP table with open orders
f. Feed the OLTP table with loading history with a corresponding status