Can you lose data in Kafka? 4 scenarios you should know about
Blog
October 15, 2024

Can you lose data in Kafka? 4 scenarios you should know about

Kafka’s event hubs are designed to be resilient. When certain failures occur, they should be able to recover. However, there are still risks that you have to consider. In a previous blog, we discussed the Kafka events that are crucial for your business. Now let’s focus on four common scenarios that could lead to data loss, highlighting the need for a backup on cold storage to ensure recovery.

1. Errors related to the user

Errors like these usually happen when users or administrators change configuration settings without fully considering the potential consequences. Even with tools like Infrastructure as Code (IaC) and pull request reviews, mistakes can slip through if the reviewer is unaware of the full impact of the changes or if they overlook critical details.

Misconfigured Topic Retention Policies

  • Retention time too short: Kafka allows you to configure how long messages are kept before deletion. If the retention.ms (retention time in milliseconds) is set too short, messages will be deleted sooner than expected. A simple mistake like missing a few zeros can have significant consequences. For instance, setting it to 60,000 means messages are only stored for 60 seconds.
  • Retention bytes set too low: Kafka can also delete messages based on topic size. If retention.bytes is configured too low, Kafka may delete older messages prematurely to free up space. A misstep here can lead to unintended data loss.

Manual Deletion of Log Segments

Kafka saves events in log segments on disk, and administrators with access can manually delete these segments. If they are unaware of the repercussions, deletions can happen unintentionally. For example: deleting specific messages via the kafka-delete-records tool or through code (DeleteRecords operation) can lead to accidental loss of data.

Incorrect Use of the Compact Cleanup Policy

Kafka's log compaction feature allows old records to be removed once newer ones with the same key arrive. If a topic is wrongly configured with cleanup.policy=compact instead of cleanup.policy=delete, Kafka will only keep the latest message for each key. This can result in losing older messages if the intention was to preserve all versions.

2. Bugs in applications

While Kafka provides solid mechanisms for data retention, application bugs can lead to unanticipated data loss. Here are some common application issues that might result in Kafka events being deleted:

Wrong producer configuration

  • A bug may cause an application to mistakenly produce events to the wrong Kafka topic or partition.
  • Kafka producers automatically retry sending messages when failures occur. If the retry logic is misconfigured, events that were never successfully written to the topic may be lost.
  • If the producer is set with acks=0, there is no confirmation that messages were successfully written to Kafka. A safer configuration like acks=1 or acks=all could prevent data loss, but if misconfigured, data may be dropped.

Consumer Bugs and Data Deletion in Compacted Topics

In Kafka’s log-compacted topics, records with identical keys are consolidated. Bugs in client applications can cause data to be unintentionally overwritten or deleted:

  • Null value production: producing a record with a null value (and the same key) to a log-compacted topic signals Kafka to delete the record. Bugs could inadvertently produce these null values, leading to unwanted deletions.
  • Wrong keys in compacted topics: an application may accidentally produce messages with incorrect keys in a log-compacted topic. This can overwrite older messages with the same incorrect key, leading to unexpected deletion of important records.

Misusing Transactional Producers

Kafka's exactly-once semantics (EOS) allow you to ensure that a set of writes is committed atomically, but incorrect implementation can lead to data issues:

  • Failed transaction: if a producer does not properly commit a transaction, Kafka will abort the transaction and the message will never appear in the topic; the data is rolled back and effectively deleted.
  • Incorrect transaction boundaries: bugs in handling transactions can lead to incomplete or inconsistent incorrectly ordered events. For example, if an application marks part of a transaction as successful but does not include all necessary records, Kafka may drop the entire transaction.

Application Bugs in Custom Message Compaction

Applications that use custom logic to aggregate or transform messages in a Kafka log-compacted topic are prone to bugs that can lead to unexpected data loss. Common issues include:

  • Incorrect aggregation logic: When an application aggregates multiple records and writes the result back to Kafka, a bug in the logic could lead to premature deletion of older records, even if they should have been retained.
  • Faulty key selection: If an application uses the wrong logic for choosing keys in a compacted topic, it might overwrite or delete records unintentionally, especially if unique keys are meant to persist across multiple records.

Improper Use of Kafka DeleteRecords API

If an application manages data in Kafka topics using the Kafka DeleteRecords API, bugs in its implementation can lead to serious data loss. For instance:

  • Incorrect offset range: if the application miscalculates or misinterprets the offset range it aims to delete, it could end up removing records outside the intended range.
  • Race conditions in deletion: when deletion processes run concurrently and the logic is not properly synchronized, the wrong set of records might be deleted.

3. Security breaches

Apache Kafka can be susceptible to a range of security threats if not properly secured. Below are some common security breaches that can occur in a Kafka environment:

Unauthorized Access to Kafka Topics

One of the most frequent security vulnerabilities involves unauthorized access to Kafka topics, either for reading or writing data. Misconfigured Access Control Lists (ACLs) can allow unauthorized users or services to access sensitive topics, leading to issues such as unauthorized data consumption, the injection of malicious data, or the modification of critical data streams.

Denial of Service (DoS) Attacks

A Denial of Service (DoS) attack can overwhelm Kafka brokers, degrading service performance or making Kafka’s event streaming capabilities entirely unavailable.

Kafka is built to handle high data throughput, but attackers can intentionally flood brokers with massive amounts of messages, causing resource exhaustion (CPU, memory, disk). This can result in slow performance, unresponsive brokers, or even the failure of the entire Kafka cluster. Attackers may also send malformed messages or exploit broker vulnerabilities, making the system even more vulnerable.

4. Corrupted data

While Apache Kafka is built for reliability, certain conditions can still lead to data corruption, typically due to hardware failures, misconfigurations, or operational errors. Here are a few scenarios where corruption can occur:

Hardware Failures and Disk Issues

Kafka’s reliance on disk-based storage makes it vulnerable to hardware malfunctions. Disk failures or memory issues can cause data corruption at the file system level. For example, if a Kafka broker writes data to a disk with bad sectors, it may corrupt the log files. This corruption can spread across the Kafka cluster if the broker sends corrupted data to other brokers or consumers.

Misconfigured Replication Settings

Misconfigurations in replication settings can lead to inconsistencies between replicas. If a leader broker fails while replicas are out of sync, data corruption or loss may occur.

Broker Crashes During Writes

Broker crashes can during data writes, leading to data corruption in log segments. When a broker crashes or restarts during the process of writing data, it can leave behind incomplete log segments. These segments may be unreadable or corrupted when the broker attempts to recover and replay the logs.

Corruption of Kafka Log Segments

Kafka log segments can become corrupted due to improper log handling, hardware failures, or misconfigured log compaction settings.

Conclusion

This article examined several scenarios that can lead to data loss in Kafka, such as misconfigurations, hardware failures, operational errors, and network problems. While many of these risks can be mitigated by adopting best practices and utilizing appropriate tools, they highlight the vital need to maintain data integrity within a Kafka environment, similar to what is expected from any database system.

Given the possibility of data loss, it is clear that, in addition to Kafka’s inherent reliability features, implementing continuous backups in cold storage and establishing a comprehensive disaster recovery plan are essential steps for protecting critical data and ensuring successful recovery in the event of a worst-case scenario.