How to get CPU usage percentage for a namespace from Prometheus?

17,269

I found out why I couldn't use the metric I cited above. It's because usually there are only a few pods that even have a CPU limit setting. It is not needed in general and it would make the cluster clumsy.

So

sum(kube_pod_container_resource_limits{resource="cpu", unit="core", namespace="$Namespace"})

does sum all the existing limits on the pods of the namespace but that's not the theoretical 100% CPU usage of the namespace. This is why percentages over 100% appear sometimes.

However, I learned that theoretically the namespace could use up all the resources delegated to the nodes of the cluster. I also learned that our product would likely run on machines very similar to this test server in production. So to get the CPU usage as a percentage, it is valid to calculate namespace CPU usage / available CPU in cluster in my lucky case.

Here is how I do that:

sum (rate (container_cpu_usage_seconds_total{namespace="$Namespace"}[1m])) / sum(machine_cpu_cores) * 100

where $Namespace is the name of the namespace.

(The same applies to memory usage.)

So this is what I'm going to monitor while running load and stress tests.

Share:
17,269
zslim
Author by

zslim

Updated on September 18, 2022

Comments

  • zslim
    zslim over 1 year

    Our product lives in a Kubernetes cluster on our server. It is not in production yet, so there are multiple instances running in the cluster for different purposes, each in its own namespace. I need to run some load tests on one of the namespaces and I need to monitor CPU usage meanwhile. We have Prometheus and Grafana for monitoring.
    One of the objectives of these tests is to learn what load drives CPU usage to its maximum.

    So I'm looking for a way to query the CPU usage of a namespace as a percentage.

    Here is what I put together based on examples:

    sum (rate (container_cpu_usage_seconds_total{namespace="$Namespace"}[1m])) / sum(kube_pod_container_resource_limits{resource="cpu", unit="core", namespace="$Namespace"}) * 100
    

    However, something must be wrong with this solution because occasionally values over 100% show up on the dashboard. Thinking the units must be different, I tried to look up the exact specification of these metrics but I didn't succeed.

    (Sadly, I don't even know much about how CPU usage is calculated and what a 100% actually means.)

    I searched for metrics that could be used for this problem through a few exporters: cAdvisor, Node, kube-state-metrics and more. Even in this seemingly exhaustive article, which was brought to my attention, it is stated that the metric I'm looking for is an important one but no way is provided to query it.

    Any help would be appreciated, thank you.

  • zslim
    zslim over 4 years
    Hey, thanks for your answer. Both queries you cited give the current CPU usage of the namespaces in cores or CPU time (would be nice to know which), but that's not what I need. I need CPU usage as the proportion of the maximum CPU usage. Thank you for the exporter recommendation, I think it has the thing that I need but unfortunately I have a fixed set of exporters and kube-eagle is not in it. :(