XML ETL Processing

XML

This section introduces XML (Extensible Markup Language) in data processing and explains basic XML concepts which might help understand the use of various ETL tools to process the XML data.
The sample business case shows how to implement ETL process to process the XML data from a web-based front-end to an OLTP system and to process the operational data periodically.

XML data processing concepts

The success and broad use of XML mainly derives from it being platform independent, text based, straightforward, easy to understand, human readable and user defined.
By far the most widely known and encountered form of XML is HTML, a markup language for web pages. In data integration, the XML standard is very often used to communicate between applications.

Most commercial ETL tools have dedicated components to handle the XML processing.

Each ETL process for the XML processing task comes with its own challenges and requires its own techniques. Most ETL tools usually provide various approaches to process and transform XML data.

However, most XML ETL processing tasks tend to yield to one of the two ways the data is interpreted and represented:

  • Event XML model is a way to interpret XML as a series of events. Each incoming XML data string is treated as a separate entity - an event. In the XML event approach each of those events is converted to a record in a data flow, where each tag and attribute value is stored in a separate field. This model can be set up quickly, it works with every kind of XML document and the elements can be adjusted as needed. However, the event model does not provide the same level of support for XML Schema, in many cases this model ignores some parts of the document which may result in incomplete data. The event model is a simple way of interpreting XML strings, and one that does not involve great effort to set up.
  • Full XML parse processing model - at the time of reading the XML document, an ETL tool component knows the exact document structure and parses it accordingly. It requires additional work effort to be done in the preliminary setup phase, however this approach increases performance and ensures the proper mapping of all values and nullable fields. Another outcome of the investment in this approach is a reduced need for populating downstream components as it is far more error proof.
    The Full Object model representation is more sophisticated, the most powerful and useful choice when working with complex XML schemas.

    Business Scenario


    The hypothetical manufacturing company uses two operational systems:

  • an OLTP system which allows management of the operational data, including invoices, orders, customers, products, etc. The data is entered manually by the customer service representatives and other employees.
  • There is also a web-based application where the customers may place purchase orders online. There is no direct integration between the two systems and when an internet user places a new order, the order details are stored in an XML file on the web hosting server. A separate XML file is created for each new order and it contains data from the internet form.

    These purchase orders are stored in a designated directory on the web server and the whole collection is processed on a daily basis (overnight).

    The aim is to load the purchase orders data from the multiple XML files into an operational table with company's orders.

    Proposed solution

    In rough outline, the ETL process involves the following steps:

  • 1. Move XML source files from the web server to the ETL server (to the unprocessed_files folder)
  • 2. Process each of the XML files individually:
          a. Parse XML file using the event XML approach
          b. Pass through additional data to the processing flow, like the file name which contains dates and a timestamp
          c. Validate records (check for example if a customer exists in the database and if the data is correct)
          d. Assign order number to each purchase order
          e. Feed the OLTP table with open orders
          f. Feed the OLTP table with loading history with a corresponding status
  • 3. Once the entire XML file is processed correctly, it is moved to the processed_files folder



    Back to the Data Warehousing tutorial home