Kafka Introduction and Architecture

Kafka is Open source distributed, Steam Processing, Message Broker platform written in Java and Scala developed by Apache Software Foundation.

Kafka is massively use for enterprise infrastructure to process stream data or transaction logs on real time. Kafka provide unified, fault-tolerant, high throughput, low latency platform for dealing real time data feeds.

 Important Points about Kafka :

  • Publish/subscribe messaging system.
  • Robust queue able to handle high volume of data.
  • Work for Online and Offline message consumption.
  • In Kafka Cluster each server/Node work like broker and each broker is responsible for published record and may have zero or more partitions per topic.
  • Each records in partition consists fields key, value and timestamp.
  • Kafka use TCP Protocol to communicate between clients and servers.
  • Kafka provide Producer, Consumer, Streams and Connector java API to publish and consume from topics.

Initial Release: January, 2011

Current Release: 0.10.20

Kafka Cluster Architecture?

 Before going to discuss about Kafka Cluster Architecture . Let’s introduce about terminology use of Kafka that will easy to understand about Architecture and flow.

 Broker

Broker is stateless instance of Kafka server in Cluster. We define broker by giving unique id for each server instance. Kafka cluster can have multiple broker instances and each broker can handle hundred thousands of reads and write request or TB data per seconds of messages without any performance impact.

Zookeeper

Kafka Cluster use Zookeeper for managing and coordinating brokers. Producer and Consumer will get notification if new broker added to cluster or if any fail so that producer and consumer can decide about to point available broker.

 Topic

Topic is a category to keeps steams of records which are publish to it. Topic can have zero, one or many consumers for reading data. We can create our own topics by application and manually also.

Topic stored data on portitions and distribute over servers based on number of partition configure per topic and available brokers.

Partition

A partition stores records in sequential orders and will continually to append it. Each record in partition having sequential id number called as offset. Individual Log partition allows the records to scale up to available single server capacity.

How Topic will partitioned for  brokers/servers/nodes?

Suppose, Need to create a topic with N partitions for Kafka Cluster having M brokers.

If (M==N) : Each broker will have one partition.

If (M>N): First available N broker will take one partition for each.

If (M<N): Some brokers may have more than one partitions.

Kafka cluster will retain these partition  as configured for retention policy in server.properties file while it’s consumed or not by default it’s configure for two months and we can modify based on our storage capacity. Kafka performance don’t impact based on data size because it read and write data based on offset values.

 Kafka Cluster Architecture with Multi distrubuted servers

 Detail about above Kafka Cluster for Multi/distributed servers.

Kafka Cluster: Having three servers  and each server is having corresponding  brokers as id 1, 2 and 3.

Zookeeper: Zookeeper runs over Kafka Cluster which keeps detail of availability of brokers and update producers and consumers.

Brokers: 1,2 and 3 which are having topics as T1, T2 and T3 stored in partitions.

Topics: Topics T1, T2 is partitioned as 3 and distributed over the servers 1, 2 and 3 while Topic 3 is partitioned as 1 that is stored in server 3 only.

Partition: Each partitioned for topic is having different no of records from offset 0 to some value where 0 represents oldest records.

Producers: APP1, APP2 and APP3 is writing to different topics on T1, T2 and T3 which are created by Applications or manually.

Consumers: Topic T3 is consumed by applications APP5 and APP6 while Topic T1 is consumed by APP4 and T2 is consumed by APP5 only.  One topic can be consumed by multiple APPs.

How Kafka Cluster Flow works for Producers and Consumers?

 I will divide above architectures in two parts as “Producer to Kafka Cluster” and “Kafka Cluster to Consumer” because producer and consumer runs parallel and independent of each others.

Producer to Kafka Cluster

  • Create Topic manually or by application with configuration for portioned and replica.
  • Producer will connect with Kafka Cluster with topic name . Kafka cluster will check in Zookeeper for available broker and send broker id to Producer .
  • Producer will publish message to available broker to store in sequential order to partition. If anything got change with Kafka cluster servers like add or fail server Zookeeper updated to Producer.
  • If replication is configured for topic will keep copy of partition on another server for fault tolerant.

Kafka Cluster to Consumer

  • Consumer will point to Topic on Kafka Cluster  as required by Application.
  • Consumer will subscribe records from Topic based on required offset value (like beginning, now or from last).
  • If consumer wants records from now  Zookeeper will send offset value to Consumer to start read records from brokers partitions.
  • If  required offset is not exist in Broker partition where Consumer was pointing reading data then Zookeeper will return available broker id  with partition detail to Consumer.
  • If in between one broker is down during Consumer is reading records from it. Zookeeper will send will send available broker id  with partition detail to Consumer.

Kafka Cluster with Single Server: Will create no of partition on the same server per topic.

Kafka Cluster with Multi Server/Distributed: Topic partitions logs are distributed on all the servers in the Kafka cluster and each server is able to handle data and requests for share partitions. If replication is configure servers will keep number of copies of partition logs distributed to servers for fault tolerance.

Kafka Cluster Load balance for multi-server or distributed servers?

For each Topic partitions log having one server/broker as “leader” while others are followers (if multi server/distributed). Leader handles all read and write requests from producer and consumer while followers make replica of lead server partition. If somehow leader fail or server down for one machine then one of followers will become leader and rest server will follower. For more detail go to Kafka Cluster with multi server on same machine.

Read More on Kafka

Integration

Integrate Filebeat, Kafka, Logstash, Elasticsearch and Kibana