How to find the root cause of high CPU usage of Kafka brokers?

10,851

If you have access to JMX metrics you are almost done for profiling CPU. All thing have to do is installing Prometheus and Grafana and then store metrics in Prometheus and monitor them with Grafana. You can find complete steps in Monitoring Kafka

Grafana Dashboard for cluster monitoring

Note: If you are suspicious about snappy compression, maybe this performance test can help you

Update:

Based on Confluent, most of the CPU usage is because of SSL.

Note that if SSL is enabled, the CPU requirements can be significantly higher (the exact details depend on the CPU type and JVM implementation).

You should choose a modern processor with multiple cores. Common clusters utilize 24 core machines.

If you need to choose between faster CPUs or more cores, choose more cores. The extra concurrency that multiple cores offers will far outweigh a slightly faster clock speed.

Share:
10,851

Related videos on Youtube

kentor
Author by

kentor

Updated on June 04, 2022

Comments

  • kentor
    kentor almost 2 years

    I am in charge of operating two kafka clusters (one for prod and one for our dev environment). The setup is mostly similiar, but the dev environment has no SASL/SSL setup and uses just 4 instead of 8 brokers. Each broker is assigned to a dedicated google kubernetes node with 4 vCPU and 26GB RAM.

    On our dev environment we've got roughly 1000 messages in / sec and each of the 4 brokers uses pretty consistently 3 out of the 4 available CPU cores (75% CPU usage).

    On our prod environment we got about 1500 messages in / sec and the CPU usage is also 3 out of 4 cores there.

    It seems that CPU usage is at least the bottleneck for us and I'd like to know how I can perform a CPU profiling, so that I know what exactly is causing the high cpu usage. Since it's relatively consistent I guess it could be our snappy compression.

    I am interested in all ideas how I could investigate the cause of the high cpu usage and how I could tweak that in my cluster.

    • Apache Kafka version: 2.1 (CPU load used to be similiar on Kafka 0.11.x too)

    • Dev Cluster (Snappy compression, no SASL/SSL, 4 Brokers): 1000 messages in / sec, 3 CPU cores consistent usage

    • Prod cluster (Snappy compression, SASL/SSL, 8 Brokers): 1500 messages in / sec, 3 CPU cores consistent usage

    Side note: I already made sure producers produce their messages snappy compressed. I have access to all JMX metrics, couldn't find anything useful for figuring out the CPU usage though.

    I already have metrics attached to my prometheus (this is where I got the CPU usage stats from too). The problem is that the container's CPU usage doesn't tell me WHY it is that high. I need more granularity e. g. what are CPU cycles being spent on (compression? broker communication? sasl/ssl?).

  • kentor
    kentor about 5 years
    Actually it seems like SSL/SASL isn't the reason because as described one of my two clusters doesn't have SASL/SSL enabled but still has such a high CPU usage. Based on the hardware recommendations given by confluent I believe that the CPU usage might be normal though.
  • avp
    avp about 5 years
    Were you able to figure out the reason? We recently upgraded to Kafka 2.1 and facing a similar issue. 25-40% of the entire fleet is running on a high CPU, roughly between 80-90% whereas the rest of the fleet is smoothly running between 35-50%
  • kentor
    kentor almost 5 years
    No I haven't figured it out. To be honest there was no difference between Kafka 0.11. and 2.2 for us. We run 8 brokers, all topics snappy compressed (producers also use snappy compression). Each broker uses 3-4 cores CPU usage with a message throughput of 2-3k messages / sec cluster wide. We believe the snappy compression is the reason for the high cpu usage because it has to decompress all messages sent by our producers before compressing them again (checksum reasons).
  • kentor
    kentor about 3 years
    Update: The high number of requests plus the high number of Partitions seem to have a major impact. With Kafka 2.6 the CPU usage decreased by 15-20%. We have lots of clients but rather low throughput. We process ~1-5M API requests / second in total. It looks like this requires some CPU