Kafka broker constantly ISR shrinking and expanding?

10,121

There was that bug in KAFKA-4477 that got fixed, but in general, I've seen this same problem when Kafka brokers time out when talking to a zookeeper node (default is 6000ms timeout), for some transient network blip, at which point they get kicked out of the cluster, partition leadership changes, clients have to rebalance, etc. For high volume clusters, it's a pain.

Simply increasing this timeout has helped me several times before:

 zookeeper.session.timeout.ms

The default value according to the official docs is 6000ms. I found simply increasing it to 15000ms caused the cluster to be rock solid.

Documentation for 0.11.0 Kafka version: https://kafka.apache.org/0110/documentation.html

Share:
10,121

Related videos on Youtube

Baby.zhou
Author by

Baby.zhou

Updated on September 15, 2022

Comments

  • Baby.zhou
    Baby.zhou about 1 year

    We have a cluster of 4 nodes in production. We observed that one of the nodes ran into a situation where it constantly shrunk and expanded ISR for more than 1 hours and unable to recover until the broker was bounced.

    [2017-02-21 14:52:16,518] INFO Partition [skynet-large-stage,5] on broker 0: Shrinking ISR for partition [skynet-large-stage,5] from 2,0 to 0 (kafka.cluster.Partition)
    [2017-02-21 14:52:16,543] INFO Partition [skynet-large-stage,37] on broker 0: Shrinking ISR for partition [skynet-large-stage,37] from 1,0 to 0 (kafka.cluster.Partition)
    [2017-02-21 14:52:16,544] INFO Partition [skynet-large-stage,13] on broker 0: Shrinking ISR for partition [skynet-large-stage,13] from 1,0 to 0 (kafka.cluster.Partition)
    [2017-02-21 14:52:16,545] INFO Partition [__consumer_offsets,46] on broker 0: Shrinking ISR for partition [__consumer_offsets,46] from 3,2,0 to 3,0 (kafka.cluster.Partition)
    .
    .
    

    I'd like to know what would cause this issue and why the broken broker was not kicked out of ISR.

    Kafka version is 0.10.1.0

    • Sönke Liebau
      Sönke Liebau over 6 years
      Was there anything in the other nodes logs when this occurred? There are two reasons why a broker would shrink the ISR: 1. the replica cannot keep up with the data 2. the replica has tried to read anything for a while. So I would suspect that the issue is either with the other brokers or with connectivity between the nodes. Once the replicas read up to the end of the partition they will be added to the ISR - this probably caused the "bouncing" you saw. The broker was not dropped from the ISR because it was almost certainly the leader for this partition.