Kafka - what it is?

Kafka is a streaming (queue-like) platform which is great for building real-time streaming data pipelines and applications. In contrary to traditional ETL Tools it lets process records as they occur and each record contains a key, a value and a timestamp. It is easily scalable, fault tolerant and designed to process huge amounts of data.

Basic terms and concepts related to Kafka

More detailed information about Kafka and architecture concepts can be found on www.confluent.io.

Kafka POC setup tutorial

This tutorial covers a step by step guide on how to set up and start using Kafka for a test POC case scenario in five steps. There's a good documentation on apache kafka website and thousands of online sites elaborating on the details, the focus for this tutorial is to keep it as simple as possible and get it running.

Kafka POC setup - step by step tutorial

1. Download Kafka

The easiest way to obtain Kafka is to download it from kafka.apache.org/downloads. Select the latest stable binary release.

Download Kafka from apache website:
Download Kafka from apache website

2. Install

Install kafka by unpacking the tgz file to the local folder (c:\kafka for example)

3. Kafka configuration

Perform Basic Kafka configuration by editing files in config subfolder

4. Run Zookeeper and Kafka cluster

  • Start zookeeper:
    start bin\windows\zookeeper-server-start.bat config\zookeeper.properties

    Note that if the following Java error pops up (Error: missing `server' JVM at ...\bin\server\jvm.dll) you need to create a server folder in jre\bin and copy all files from the client folder.

    Kafka Java Error Missing JVM:
    Kafka Java Error Missing JVM

  • Start two kafka servers (in order to avoid runtime errors it's good to wait a few seconds until first server is started and then start the second one.
    start bin\windows\kafka-server-start.bat config\server1.properties start bin\windows\kafka-server-start.bat config\server2.properties

  • Verify that Zookeeper and Kafka servers are running. There should be three shell windows open in this example.
    Kafka and Zookeeper running:
    Kafka and Zookeeper running

    5. Test

    At this point zookeeper and kafka is running and we should be able to perform some tests.

    First let's create sample topics using kafka-topics.bat script (for example customer, product, order) on two partitions and with replication factor = 2 (if one server fails, we'll not lose any messages):

    bin\windows\kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 2 --partitions 2 --topic customer_topic bin\windows\kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 2 --partitions 2 --topic product_topic

    Creation of topics should be visible in the console output and also it's interesting to check the log.dirs folder configured earlier.

    Let's start the console producer and consumer:

    bin\windows\kafka-console-producer.bat --topic customer_topic --broker-list localhost:9092,localhost:9093 bin\windows\kafka-console-consumer.bat --bootstrap-server localhost:9092 --from-beginning --topic customer_topic

    Try to write some text in the producer window, if it shows up visible in the consumer shell window, the set up was successful. Now it's good time to do some testing and play around with the environment (f.ex. shutdown one of the kafka servers, see that happens, run a consumer without --from-beginning parameter etc. )
    Kafka Console Producer and Consumer sample output:
    Kafka Console Producer and Consumer sample output

    You can also use the following command to describe the kafka topic status:
    kafka-topics.bat --describe --zookeeper localhost:2181 --topic customer_topic

    Back to the Data Warehousing tutorial home