Kafka: Consumer API vs Streams API

apache-kafka kafka-consumer-api apache-kafka-streams

51,519

Solution 1

Update January 2021: I wrote a four-part blog series on Kafka fundamentals that I'd recommend to read for questions like these. For this question in particular, take a look at part 3 on processing fundamentals.

Update April 2018: Nowadays you can also use ksqlDB, the event streaming database for Kafka, to process your data in Kafka. ksqlDB is built on top of Kafka's Streams API, and it too comes with first-class support for Streams and Tables.

what is the difference between Consumer API and Streams API?

Kafka's Streams library (https://kafka.apache.org/documentation/streams/) is built on top of the Kafka producer and consumer clients. Kafka Streams is significantly more powerful and also more expressive than the plain clients.

It's much simpler and quicker to write a real-world application start to finish with Kafka Streams than with the plain consumer.

Here are some of the features of the Kafka Streams API, most of which are not supported by the consumer client (it would require you to implement the missing features yourself, essentially re-implementing Kafka Streams).

Supports exactly-once processing semantics via Kafka transactions (what EOS means)
Supports fault-tolerant stateful (as well as stateless, of course) processing including streaming joins, aggregations, and windowing. In other words, it supports management of your application's processing state out-of-the-box.
Supports event-time processing as well as processing based on processing-time and ingestion-time. It also seamlessly processes out-of-order data.
Has first-class support for both streams and tables, which is where stream processing meets databases; in practice, most stream processing applications need both streams AND tables for implementing their respective use cases, so if a stream processing technology lacks either of the two abstractions (say, no support for tables) you are either stuck or must manually implement this functionality yourself (good luck with that...)
Supports interactive queries (also called 'queryable state') to expose the latest processing results to other applications and services via a request-response API. This is especially useful for traditional apps that can only do request-response, but not the streaming side of things.
Is more expressive: it ships with (1) a functional programming style DSL with operations such as map, filter, reduce as well as (2) an imperative style Processor API for e.g. doing complex event processing (CEP), and (3) you can even combine the DSL and the Processor API.
Has its own testing kit for unit and integration testing.

See http://docs.confluent.io/current/streams/introduction.html for a more detailed but still high-level introduction to the Kafka Streams API, which should also help you to understand the differences to the lower-level Kafka consumer client.

Beyond Kafka Streams, you can also use the streaming database ksqlDB to process your data in Kafka. ksqlDB separates its storage layer (Kafka) from its compute layer (ksqlDB itself; it uses Kafka Streams for most of its functionality here). It supports essentially the same features as Kafka Streams, but you write streaming SQL statements instead of Java or Scala code. You can interact with ksqlDB via a UI, CLI, and a REST API; it also has a native Java client in case you don't want to use REST. Lastly, if you prefer not having to self-manage your infrastructure, ksqlDB is available as a fully managed service in Confluent Cloud.

So how is the Kafka Streams API different as this also consumes from or produce messages to Kafka?

Yes, the Kafka Streams API can both read data as well as write data to Kafka. It supports Kafka transactions, so you can e.g. read one or more messages from one or more topic(s), optionally update processing state if you need to, and then write one or more output messages to one or more topics—all as one atomic operation.

and why is it needed as we can write our own consumer application using Consumer API and process them as needed or send them to Spark from the consumer application?

Yes, you could write your own consumer application -- as I mentioned, the Kafka Streams API uses the Kafka consumer client (plus the producer client) itself -- but you'd have to manually implement all the unique features that the Streams API provides. See the list above for everything you get "for free". It is thus a rare circumstance that a user would pick the plain consumer client rather than the more powerful Kafka Streams library.

Solution 2

Kafka Stream component built to support the ETL type of message transformation. Means to input stream from the topic, transform and output to other topics. It supports real-time processing and at the same time supports advance analytic features such as aggregation, windowing, join, etc.

"Kafka Streams simplifies application development by building on the Kafka producer and consumer libraries and leveraging the native capabilities of Kafka to offer data parallelism, distributed coordination, fault tolerance, and operational simplicity."

Below are key architectural features on Kafka Stream. Please refer here

Stream Partitions and Tasks: Kafka Streams uses the concepts of partitions and tasks as logical units of its parallelism model based on Kafka topic partitions.
Threading Model: Kafka Streams allows the user to configure the number of threads that the library can use to parallelize processing within an application instance.
Local State Stores: Kafka Streams provides so-called state stores, which can be used by stream processing applications to store and query data, which is an important capability when implementing stateful operations
Fault Tolerance: Kafka Streams builds on fault-tolerance capabilities integrated natively within Kafka. Kafka partitions are highly available and replicated, so when stream data is persisted to Kafka it is available even if the application fails and needs to re-process it.

Based on my understanding below are key differences I am open to update if missing or misleading any point

Where to use Consumer - Producer:

If there are single consumers, consume the message process but not spill to other topics.
As point 1 if having just producer producing message we don't need Kafka Stream.
If consumer message from one Kafka cluster but publish to different Kafka cluster topics. In that case, even you can use Kafka Stream but you have to use a separate Producer to publish messages to different clusters. Or simply use Kafka Consumer - Producer mechanism.
Batch processing - if there is a requirement to collect a message or kind of batch processing its good to use a normal traditional way.

Where to use Kafka Stream:

If you consume messages from one topic, transform and publish to other topics Kafka Stream is best suited.
Realtime processing, realtime analytic, and Machine learning.
Stateful transformation such as aggregation, join window, etc.
Planning to use local state stores or mounted state stores such as Portworx etc.
Achieve Exactly one processing semantic and auto defined fault tolerance.

51,519

Author by

sabtharishi

Looking for a remote position to work from India Experienced System Analyst/Senior Java Developer with a demonstrated history of working in the information technology and services industry. Having hands on experience in Java, Spring Boot, MongoDB, Spring, Hibernate, Amazon Web Services (AWS), Couchbase DB, JMS, Junit, Mockito, EasyMock, MSSQL, Oracle, Postgres. 10+ years of professional experience in developing enterprise applications using Object oriented methodologies, enterprise technologies, frameworks and design patterns. Hands on experience in the areas of Web applications using J2EE technologies like Java, Servlets, JSP, EJB, JDBC, Web Services (SOAP, Restful, WSDL), Hibernate. Strong technical skills and experience with different modules of Spring framework like Spring Core, Spring MVC, Spring AOP, Spring ORM, Spring Data. Strong technical skills and experience in developing Restful services using Spring MVC and Spring Boot. Hands on experience in using NoSQL DBs like MongoDB and Couchbase. Hands on experience in using relational DBs like MSSQL, Oracle and Postgres. Experience in using Java Messaging systems (JMS) like Active MQ, AWS SQS and SNS. Experience in using search engines like Lucene, Solr and Elastic Search. Hands on experience in cloud platforms using AWS EC2, ElasticBean Stack, S3 and Creating instance snapshots and volumes. Hands on experience in using Ansible 2 to configure EC2 and application deployment. Strong experience in unit testing frameworks like Junit, Mockito, EasyMock and PowerMock. Hands on experience in build tools like Ant, Maven, Gradle and CI Jenkins. Hands on experience in Alfresco content management system.

Updated on April 07, 2022

Comments

sabtharishi about 2 years
I recently started learning Kafka and end up with these questions.
1. What is the difference between Consumer and Stream? For me, if any tool/application consume messages from Kafka is a consumer in the Kafka world.
2. How Stream is different as this also consumes from or produce messages to Kafka? and why is it needed as we can write our own consumer application using Consumer API and process them as needed or send them to Spark from the consumer application?
I did Google on this, but did not get any good answers for this. Sorry if this question is too trivial.
bhh1988 almost 6 years

In what case would an application use Kafka Consumer API over Kafka Streams API?
miguno almost 6 years

Primarily in situations where you need direct access to the lower-level methods of the Kafka Consumer API. Now that Kafka Streams is available, this is typically done for rather custom, specialized applications and use cases. Here's an analogy: Imagine that Kafka Streams is a car -- most people just want to drive it but don't want to become car mechanics. But some people might want to open and tune the car's engine for whatever reason, which is when you might want to directly use the Consumer API. (That being said, Kafka Streams also has the Processor API for custom needs.)
Yonatan Kiron over 4 years

I think that the main thing that differentiate them is the ability to access store. Once you understand the strength of using store within a stream, you will understand the power of kafka streams.
uptoyou over 4 years

Awesome, really helpful, but there is one major mistake, Exactly once semantic available in both Consumer and Streams api, moreover EOS is just a bunch of settings for consumer/producer at lower level, such that this settings group in conjunction with their specific values guarantee EOS behavior. Currently i'm using EOS with Consumer api without issues.
Nitin over 4 years

Yeah right we can define Exactly once semantic in Kafka Stream by setting property however for simple producer and consumer we need to define idempotent and transaction to support as an unit transaction
Nitin over 4 years

did changes on wording as per suggestion
Nag almost 4 years

@sun007, which is faster for simple applications which doesnt need realtime capabilities ? and also, does using streaming adds "extra" conversion overhead like any other high level tools on top of kafka native functionality ?
miguno about 3 years

@uptoyou: "moreover EOS is just a bunch of settings for consumer/producer at lower level" This is not true. The EOS functionality in Kafka Streams has several important features that are not available in the plain Kafka consumer/producer. It is possible to implement this yourself (DIY) with the consumer/producer, which is exactly what the Kafka developers did for Kafka Streams, but this is not easy. Details at confluent.io/blog/enabling-exactly-once-kafka-streams