Posts

A guide to event streaming with Apache Kafka

Mar 7, 2024
Dmitry Chuyko
12.0

Cloud-native development unlocks almost immense application scalability and data processing opportunities. Modern technologies can process vast amounts of various data in real-time. On the other hand, millions of users can interact with a network of services, connected to other services and databases on their end, and receive instant feedback. But the greater the prospects, the stronger the challenges. Companies nowadays can’t do without software that helps to coordinate and manage massive data flows. One of the most common solutions for these purposes is Apache Kafka.

In this article, we discuss the aspects of data processing in a modern IT environment and explore the Apache Kafka architecture. In the next article, we’ll see how to install Kafka on a local machine.

What is Apache Kafka

Event handling in modern applications

An event is an action or occurrence usually happening asynchronously, which is recognized and can be handled by the software. Events come from the user (mouse, keyboard, touchscreen events, etc.) or the system (program or system errors, motion sensed by the sensor, messages sent from other programs, etc.) An application is called event-driven if it changes its behavior in response to the event, as opposed to a data-driven program, whose aim is to process the incoming data and output the result.

Events in computing are always associated with data, so such concepts as event processing, event streaming, event messaging, etc., describe how the system handles and reacts to the incoming data.

Event is an essential aspect of reactive programming, serverless computations, and microservices. Microservices are a preferable architectural approach to cloud-native development as they are more space-efficient, resilient, and scalable than monoliths. These services are loosely coupled and communicate with each other when events occur. Therefore, understanding event-handling approaches is crucial to developers working with cloud-native programs.

Event messaging

Event streaming and event messaging are two concepts sometimes used interchangeably, although they have several significant differences.

In event messaging (or publisher-subscriber, pub-sub) design, an application responds to each message received. Each message is discrete and can be read and understood in isolation. In addition, the sender (publisher) and receiver (or subscriber) are known to each other, and the messages are sent to a specific address and get deleted upon consumption. As messages get deleted, new subscribers cannot access the message history.

A commonly-used technique of message handling with this approach is message queuing. Messages sent by a publisher are stored in a queue following a FIFO (first in, first out) principle. Consumers listen to the queue for messages. The first consumer who picks up a message deals with it, and after that, the message is deleted. Each message is processed only once by a single consumer.

Event streaming

In event streaming, individual messages are unimportant. They are aggregated and analyzed to discover a pattern that can be acted upon. The receiver is also unknown to the sender. The sender streams messages, and interested receivers can subscribe to them. Furthermore, receivers don’t delete the messages, and new receivers have access to a history of messages and can synchronize with the current state by consuming the whole stream from the beginning. The technology that helps to analyze event streams is called “Complex Event Processing” (CEP), a set of techniques for capturing and analyzing data streams as they arrive to identify opportunities or threats in real-time. CEP enables systems and applications to respond to events, trends, and patterns in the data as they happen.

Event streaming vs event messaging

To sum up, these two approaches are suitable for specific purposes and can be used complementary depending on the result you want to achieve.

About Apache Kafka

Apache Kafka is an open-source event streaming platform initially created by LinkedIn and currently developed under the auspices of the Apache Software Foundation. It is

  • Highly scalable (tailored to horizontal scaling),
  • Fault-tolerant thanks to the efficient replication mechanisms, and
  • Can be deployed on bare metal, VMs, containers, on premises, in the cloud.

Kafka’s key difference from similar solutions such as RabbitMQ is that the events are not deleted right after being read by consumers. Instead, they can be stored as long as required — days, months, years, or forever — and processed multiple times, which is useful in cases of failure recoveries or to verify the code of new consumers

In addition, Kafka is a powerful solution for real-time data processing and establishing communication between microservices.

Apache Kafka architecture

Let’s briefly discuss the structure of Kafka’s cluster and how events are managed within the system.

The key components of the cluster are producers, consumers, and brokers.

  • A producer is a client application that publishes (i.e., sends) events to a broker. An event consists of a key-value pair, a timestamp, and additional metadata.
  • A broker (also known as a Kafka server or a Kafka node) is responsible for storing the data, and it acts as a bridge between producers that publish events and consumers reading these events. Upon receiving an event from a producer, a broker stores it in a dedicated topic. Each Kafka broker can host several topics, and each producer and consumer can write/read events to several brokers.
  • A consumer is a client application that reads events from the topic it is subscribed to using the pull model meaning that it sends requests to the server to receive a new batch of events.

Apache Kafka cluster architecture

But what if a broker fails? Or a consumer? How exactly is Fafka’s fault-tolerance guaranteed? And besides, if the events can be stored for a prolonged period of time, how do we know that the consumer doesn’t read the same events over and over again?

To answer these questions, we have to dive deeper into the system structure. 

Kafka internals: topics, partitions, and consumer groups 

The events in Kafka are not stored in the topics randomly. Each topic is divided into partitions representing replicable logs stored on the disc. The events are always written to the partition head according to their keys guaranteeing the read/write order. Besides, each event receives a unique offset number, so in case of restart, a consumer can continue from the offset it registered.

The partitions are always replicated n times among multiple brokers. The replication factor can be adjusted: for instance, if it equals 3, there are three copies of a partition across various brokers. Therefore, even if a broker fails, consumers can continue reading events from relevant partitions from other brokers.

The consumers do not stand alone — they belong to consumer groups. A consumer group can have one consumer or several. Ideally, the number of consumers equals the number of partitions in a topic, so that the load can be evenly distributed among them. But if one consumer fails, the load is automatically redistributed among the remaining consumers. On the other hand, if the consumers can’t keep up with the load, you can add an additional consumer to the group and at the same time, one more partition to the topic. Only then can a new consumer get involved into the event processing.

Apache Kafka partitions and consumer groups

It is important to note that right now, Kafka relies on ZooKeeper, an open-source server by the Apache Foundation, to store metadata about partitions and brokers. ZooKeeper is set up independently of Kafka nodes and represents a distributed data storage system that has to be controlled separately, adding complexity to the platform management. What is worse, if the ZooKeeper fails, the data on current processes within the cluster will be lost, making it extremely challenging to restore the workflow.

This is one of the reasons why the Kafka team decided to substitute ZooKeeper with a more robust modern solution called KRaft. In short, KRaft allows for storing metadata as topics within the Kafka cluster, making metadata processing faster and more scalable. In addition, metadata topics are replicated among brokers, which increases fault tolerance. You can read more about KRaft in the dedicated Kafka Improvement Proposal KIP-500.

Zookeeper is officially deprecated in Kafka 3.5 and is set for removal in version 4.0. But migration to KRaft is still in early stages as the feature is not yet-production ready.

Conclusion

Apache Kafka is a complex system, and understanding its key concepts is essential for setting up effective communication between the services. In the next article, we’ll dive into setting up Kafka on a local machine, and even more tutorials are on the way, so subscribe to our newsletter and don’t miss them!

Subcribe to our newsletter

figure

Read the industry news, receive solutions to your problems, and find the ways to save money.

Further reading