How Kafka stores data on disk?
It is worth understanding how kafka stores data to better appreciate how the brokers achieve such high throughput. In fact, the way that Kafka stores data is extremely simple to understand. Kafka simply has a data directory on disk where it stores a log data and text files. The location on the broker where Kafka stores the data is
/var/lib/kafka/data (this is sometimes referred to as the Kafka data directory), shown below is one topic:
root@e77c276dd7af:/var/lib/kafka/data# ls -rtld com.kafka* drwxr-xr-x 2 root root 4096 Jan 4 02:17 com.kafka.producers.purchases-4 drwxr-xr-x 2 root root 4096 Jan 4 02:17 com.kafka.producers.purchases-3 drwxr-xr-x 2 root root 4096 Jan 4 02:17 com.kafka.producers.purchases-2 drwxr-xr-x 2 root root 4096 Jan 4 02:17 com.kafka.producers.purchases-1 drwxr-xr-x 2 root root 4096 Jan 4 02:17 com.kafka.producers.purchases-0 root@e77c276dd7af:/var/lib/kafka/data#
Each one of these directories is actually a partition. For this topic, we specified
num_partitions=5, that’s why we see 5 directories.
We can almost think of the topic as a simple abstraction over the actual partitions on disk.
If we look under
com.kafka.producers.purchases-0, we see the following files:
root@e77c276dd7af:/var/lib/kafka/data/com.kafka.producers.purchases-0# ls -rtl total 12 -rw-r--r-- 1 root root 11 Jan 4 02:16 leader-epoch-checkpoint -rw-r--r-- 1 root root 10485756 Jan 4 02:16 00000000000000009873.timeindex -rw-r--r-- 1 root root 10 Jan 4 02:16 00000000000000009873.snapshot -rw-r--r-- 1 root root 10485760 Jan 4 02:16 00000000000000009873.index -rw-r--r-- 1 root root 2099 Jan 4 13:24 00000000000000009873.log
Each topic receives its own sub-directory with the associated name of the topic. Kafka may store more than one log file for a given topic. Why is that? If Kafka were to use just one file to store the data for a given topic, it would be limited to the speed of the disk on which the broker is running. To help alleviate that, Kafka can split data into many files. You can control this when you create your topic.
Data Partitions (
To achieve high-throughput and scalability on topics, Kafka supports partitioning. When a kafka topic is partitioned, the topic log is split or partitioned into multiple files. Each of these files represents a partition. If there are multiple kafka brokers in the cluster, the partitions will typically be distributed amongst the brokers in the cluster evenly.
Every partition has a single leader broker, elected with Zookeeper.
So you can see here we have a couple of leader brokers. We have
broker_0 is the leader broker for
broker_2 is the leader broker for
partition_b. Also, notice that
broker_1 is not the leader for any topic. It is simply a replica of partition_a and partition_b, same thing with broker_2 for partition_a - it is not the leader of partition_a, so it is said to be a replica for partition_a.
When Kafka producers produce messages to a topic, by default the message is hashed to identify which partition should receive the message. So, suppose we have a producer, when it produces a message, it hashes the message. So these first two messages might get hashed to partition_a. The next incoming message, could be hashed to partition_b. Using this hashing approach, messages are distributed evenly amongst these partitions.
Why does this matter? That is, Why would you want to partition a topic? Partitions are how consumers and producer code achieve parallelism with Kafka. Kafka producers can asynchronously produce messages to many partitions at once from within the same application. Additionally, if the cluster contains more than one broker, more than one broker can receive the data as well, and thus further increasing the speed at which data is ingested. Similarly, consumers of the topic data can utilize a parallel approach for each partition. This increases the speed at which consumers can pull data off the topics.
- Partitions are Kafka’s unit of parallelism.
- Consumers may parallelize data from each partition.
- Producers may parallelize data from each partition.
- Reduces bottlenecks by involving multiple brokers.
Data Replication (
Most Kafka users will be interested in making sure their data is resilient to Kafka broker failures. As you’ve likely experienced, machines fail, and often, we cannot predict when it’s going to happen or prevent it for that matter. Thankfully, the kafka design team had the foresight to include the concept of replication as a core feature of Kafka. In Kafka, replication means that the data is written not just to one broker but to many.
So, by looking at the above image, we wrote message_a to partition_a on broker_0, but we are also going to write it to broker_1 because broker_1 has partition_a as a replica and to broker_2 because it also has a partion_a replica.
So, what does that mean? If broker_0 were to fail, broker_1 and broker_2 would still have the data. As I mentioned previously, the broker responsible for sending and receiving data to clients is known as the leader broker for a given topic partition.
Any brokers that are storing the replicated data are referred to as replicas. So, broker_0 is a replica for partion_b and broker_2 is the replica for partion_a and broker_1 is the replica for both partion_a and partition_b. If the leader node were to fail, then one of these replicas would be elected the new per-topic-partition leader by a zookeeper election.
So, for example if broker_0 were to fail, then we would no longer have a leader for partion_a. In this case, broker_1 and broker_2 will call for an election to decide who will become the leader for partion_a now that broker_0 has failed.
Implications of number of replicas:
The exact number of replicas that will be used can be configured globally as a
Kafka server configuration item as well as set individually on every topic that you create. But there are a few things you need to keep in mind:
- First, you can’t have more replicas than you have brokers. For this reason, we have given a
replication_factorof 1, since we are running locally with just one broker.
- Secondly, replication has overhead. The more nodes that need to replicate the data, the longer it takes to produce data to the cluster. That being said, it is always a good idea to enable replication if you can to prevent data loss.