Prometheus alert CPUThrottlingHigh raised but monitoring does not show it
Solution 1
The CPUThrottlingHigh
is an alert created by the kubernetes-mixin project. There is an open issue (#108) to discuss this alert. I suggest that you read all the comments on this issue to better understand the problem.
In short, the problem is: When working with low CPU limits, spiky workloads can have low averages and still be being throttled.
Also, take a look at this issue (#67577) from Kubernetes project, which addresses a Kernel bug in CFS quotas that may cause unnecessary CPU throttling. The discussion is still open, and the Kubernetes project are even considering disabling CFS quotas for pods in the Guaranteed
QoS (see #70585 for reference).
Consider the following options:
- Increase (or even remove) your container CPU limits
- Disable Kubernetes CFS quotas entirely
(kubelet's flag
--cpu-cfs-quota=false
) - Use a Kernel version that contains this fix (torvalds/linux 512ac99)
Solution 2
To piggy back off of @eduardo-baitello answer, A third option is to increase the CPUThrottlingPercent
config here
Related videos on Youtube
jobou
Updated on September 18, 2022Comments
-
jobou over 1 year
I have installed Prometheus to monitor my installation and it is frequently raising alerts about CPU throttling.
The Prometheus alert rules to identify this alert is :
alert: CPUThrottlingHigh expr: 100 * sum by(container_name, pod_name, namespace) (increase(container_cpu_cfs_throttled_periods_total{container_name!=""}[5m])) / sum by(container_name, pod_name, namespace) (increase(container_cpu_cfs_periods_total[5m])) > 25 for: 15m
If I look at one of the pods identified by this alert, it does not seem to have any reason to throttle :
$ kubectl top pod -n monitoring my-pod NAME CPU(cores) MEMORY(bytes) my-pod 0m 6Mi
This pod has one container with these resources setup :
Limits: cpu: 100m memory: 128Mi Requests: cpu: 25m memory: 64Mi
And the node that is hosting this pod is not under any heavy cpu use :
$ kubectl -n monitoring top node aks-agentpool-node-1 NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% aks-agentpool-node-1 853m 21% 11668Mi 101%
On grafana, if I look at the chart for this pod, it never goes above
0,000022
of cpu usageWhy is it throttling ?
-
sheldonhull about 5 yearsHave you validated that you need to multiple by 100? Is it a decimal format for the percentage or an integer value?
-
jobou about 5 yearsI am not sure to understand. The expression
100m
is the same as0.1
cpu. -
sheldonhull about 5 yearsIf that's the syntax for Prometheus then disregard. Was just checking.
-
-
jobou about 5 yearsThanks, it is nice to know that the issue is not on my installation. I don't want to increase or remove limits for now. And I am using AKS (a managed implementation of Kubernetes in Azure cloud) so I cannot disable CFS quotas. As per the comments in the issue, I am going to try to update the kernel version to see what happens.
-
Vadim Yangunaev over 4 years@jobou, have you update a kernel version? Was it helpful?
-
jobou over 4 yearsNo sorry, I have not worked on this issue yet so I don't have any real feedback to give.
-
Jon Carlson over 3 yearsWe are still eeing this exact same thing in Dec 2020 and we are running on GKE 1.16.15.4300. Our requests and limits are quite a bit higher than the ones in the question. Removing CPU limits is not really an option for us.
-
BPD over 3 yearsMind posting this as a comment?