Kafka - what it is?

Kafka is a streaming (queue-like) platform which is great for building real-time streaming data pipelines and applications. In contrary to traditional ETL Tools it lets process records as they occur and each record contains a key, a value and a timestamp. It is easily scalable, fault tolerant and designed to process huge amounts of data.

Basic terms and concepts related to Kafka

Kafka Broker = kafka server, kafka instance
Partition is a registry, an ordered sequence of messages. The historical data can not change, new messages are added to a partition and old messages are deleted. One partition must fit into a server, so usually they are split into smaller pieces. Kafka guarantees good performance and stability until up to 10000 partitions.
Commit log = ordered sequence of records,
Message = producer sends messages to kafka and producer reads messages in the streaming mode
Topic = messages are grouped into topics. Business examples of topics might be account, customer, product, order, sale, etc.
Producer - there might be many message producers for each topic
Consumer - all messages are read by each consumer which is listening to a topic

More detailed information about Kafka and architecture concepts can be found on www.confluent.io.

Kafka POC setup tutorial

This tutorial covers a step by step guide on how to set up and start using Kafka for a test POC case scenario in five steps. There's a good documentation on apache kafka website and thousands of online sites elaborating on the details, the focus for this tutorial is to keep it as simple as possible and get it running.

Windows machine (a laptop)
We'll set up a kafka 10.2.1 cluster with two brokers (servers)
Zookeeper service will be set up to coordinate the brokers and to make sure the state of all the kafka is equal

Kafka POC setup - step by step tutorial

1. Download Kafka

The easiest way to obtain Kafka is to download it from kafka.apache.org/downloads. Select the latest stable binary release.

Download Kafka from apache website:

2. Install

Install kafka by unpacking the tgz file to the local folder (c:\kafka for example)

3. Kafka configuration

Perform Basic Kafka configuration by editing files in config subfolder

Make two copies of config\server.properties file (server1.properties and server2.properties)
Edit the following lines in server1.properties and server2.properties:

#set 1 and 2 for each server respectively broker.id=1 # uncomment and set 9092 for server 1 and 9093 for server2 listeners=PLAINTEXT://:9092 # use separate folders for each servers, so subfolder 1 and 2 is defined log.dirs=C:/tmp/kafka-logs/1 #number of partitions per topic, let's have two num.partitions=2
Edit zookeeper.properties
#set correct dataDir folder dataDir=C:/data/zookeeper

4. Run Zookeeper and Kafka cluster

Start zookeeper:

start bin\windows\zookeeper-server-start.bat config\zookeeper.properties

Note that if the following Java error pops up (Error: missing `server' JVM at ...\bin\server\jvm.dll) you need to create a server folder in jre\bin and copy all files from the client folder.

Kafka Java Error Missing JVM:

Start two kafka servers (in order to avoid runtime errors it's good to wait a few seconds until first server is started and then start the second one.

start bin\windows\kafka-server-start.bat config\server1.properties start bin\windows\kafka-server-start.bat config\server2.properties

Verify that Zookeeper and Kafka servers are running. There should be three shell windows open in this example.

Kafka and Zookeeper running:

5. Test

At this point zookeeper and kafka is running and we should be able to perform some tests.

First let's create sample topics using kafka-topics.bat script (for example customer, product, order) on two partitions and with replication factor = 2 (if one server fails, we'll not lose any messages):

bin\windows\kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 2 --partitions 2 --topic customer_topic bin\windows\kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 2 --partitions 2 --topic product_topic

Creation of topics should be visible in the console output and also it's interesting to check the log.dirs folder configured earlier.

Let's start the console producer and consumer:

bin\windows\kafka-console-producer.bat --topic customer_topic --broker-list localhost:9092,localhost:9093 bin\windows\kafka-console-consumer.bat --bootstrap-server localhost:9092 --from-beginning --topic customer_topic

Try to write some text in the producer window, if it shows up visible in the consumer shell window, the set up was successful. Now it's good time to do some testing and play around with the environment (f.ex. shutdown one of the kafka servers, see that happens, run a consumer without --from-beginning parameter etc. )

Kafka Console Producer and Consumer sample output:

You can also use the following command to describe the kafka topic status:

kafka-topics.bat --describe --zookeeper localhost:2181 --topic customer_topic

Back to the Data Warehousing tutorial home