Read kafka queue with ETL Tools

The following article describes real-life use of a Kafka streaming and how it can be integrated with ETL Tools without the need of writing code.


Modern real-time ETL with Kafka - Architecture

The data is delivered from the source system directly to kafka and processed in real-time fashion and consumed (loaded into the data warehouse) by an ETL tool.
The important difference between the streaming approach and traditional ETL process is that all the components are constantly running (active) meaning that it is not trigerred from a schedule.

The source systems are: databases, csv files, logs, CDC which produce kafka messages (so they are active, not just have data available for fetching).
The data is processed with real-time ETL, so there's a requirement for minimum delay between the time when a row appears in the source and is processed into a Data Warehouse.

The data gets loaded into the data warehouse in an incremental way (so only delta records are captured, the history doesn't change and inserts or upserts are performed).

Calculations (aggregations, groupings) are done before writing the data in a Database, the data is stored in a columnar database or NoSQL databse in an ordered manner (Redshift, Cassandra, Couchbase for example). Hadoop HDFS is an alternative target.

The messages are delivered in JSON format (the format of JSON differs accross topic but it contains a header and then actual data). In many cases JSON message might contain hierarchical information so it needs to be flattened in order to be stored in a relational database.



Sample Kafka ETL Data Warehouse architecture:
Sample Kafka ETL Data Warehouse architecture

ETL tools capable of reading kafka

Most of the ETL software don't have an option to read or write to Kafka stream in an easy, realiable and solid way, with a few exceptions especially when open source tools are concerned:

  • Talend Open Studio (open source) - tKafkaConnection, tKafkaCreateTopic, tKafkaInput, tKafkaOutput, tKafkaCommit components
  • Pentaho Data Integration (open source) - using Apache Kafka Producer and Apache Kafka Consumer
  • IBM Information Server (Datastage) Kafka connector
  • Informatica BDM (Big Data Management) - Intelligent Streaming option, runs in Informatica Developer tool (not Designer)

    For most traditional tools which don't have a native kafka stream connector a custom Java coding is also an option.

    Solutions and sample implementations

  • Kafka set up in 15 minutes - how to install and configure Kafka in 5 steps from scratch in 15 minutes
  • Kafka Pentaho ETL Example - how to use Pentaho Data Integration ETL components to read Kafka stream

    Back to the Data Warehousing tutorial home