How to detect duplicate messages in a kafka topic?

11,087

Solution 1

Assuming that you actually have multiple different producers writing the same messages, I can see these two options:

1) Write all duplicates to a single Kafka topic, then use something like Kafka Streams (or any other stream processor like Flink, Spark Streaming, etc.) to deduplicate the messages and write deduplicated results to a new topic.

Here's a great Kafka Streams example using state stores: https://github.com/confluentinc/kafka-streams-examples/blob/4.0.0-post/src/test/java/io/confluent/examples/streams/EventDeduplicationLambdaIntegrationTest.java

2) Make sure that duplicated messages have the same message key. After that you need to enable log compaction and Kafka will eventually get rid of the duplicates. This approach is less reliable, but if you tweak the compaction settings properly it might give you what you want.

Solution 2

Now, Apache Kafka supports exactly-once delivery: https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/

Share:
11,087
ankush reddy
Author by

ankush reddy

want to learn new things all the time.

Updated on June 04, 2022

Comments

  • ankush reddy
    ankush reddy almost 2 years

    Hi I have an architecture similar to the image shown below.

    I have two kafka producer which will send messages to kafka topic with frequent duplicate messages.

    Is there a way that I can handle the situation in a easy manner something like service bus topic.

    Thank you for your help.

    enter image description here