Load average is 50 while CPU Utilization is %60
Please add sar -w 1
output. I suppose a number of context switches per second is killing your performance, because there are much more processes running than available processors. I think context switches on a virtual machine are expensive.
If it's true, then there are some kernel tunables that can help you lower number of context switches:
Check value of
systctl kernel.sched_min_granularity_ns
. Double it with a command similar tosystctl kernel.sched_min_granularity_ns=2000000
. Retest. Double it again. Retest. Repeat. Try to find a value which will not cripple interactivity too much but won't allow too many context switches and write it to/etc/sysctl.conf
so it will be set at startup.Set
apache
scheduling policy to SCHED_BATCH - start it withchrt -b 0 apache2
Related videos on Youtube
Roman Newaza
Updated on September 18, 2022Comments
-
Roman Newaza over 1 year
We use EC2 Auto Scaling and recently decided to change Instance type from m2.2xlarge to c1.xlarge (High Memory to High CPU) because average amount of used RAM per Instance is 2G, thus we don't need 34G provided by m2.2xlarge, and having more CPU power of c1.xlarge for the same price would be good idea.
But after switching to c1.xlarge, we have the issue:
- Load average became 50 while CPU Utilization dropped from %70 to %60.
- Scaling in from 6 Instances to 4 doesn't affect CPU Utilization Cloud Watch metric.
- Response time appeared to be very slow and Instances been substituting constantly with Auto Scaling because of ELB Health Check.
- Auto Scaling reduced the number of Instances from 8 to 4 because CPU Utilization dropped.
Can you explain me what might be the reason of such behavior and what can I do with it?
EC2 Instance Types Info:
High-Memory Double Extra Large Instance
34.2 GB of memory 13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each) 850 GB of instance storage 64-bit platform I/O Performance: High API name: m2.2xlarge
High-CPU Extra Large Instance
7 GB of memory 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each) 1690 GB of instance storage 64-bit platform I/O Performance: High API name: c1.xlarge
EDIT:
$ iostat -x Linux 2.6.38-13-virtual 02/17/2012 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.34 0.00 0.13 0.02 0.29 98.23 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdap1 0.04 0.09 0.08 0.13 1.50 0.87 22.99 0.01 36.59 23.42 44.75 4.04 0.08 xvdb 0.00 0.00 0.01 0.00 0.03 0.00 9.37 0.00 1.04 0.95 15.00 1.04 0.00 $ iostat Linux 2.6.38-13-virtual 02/17/2012 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.45 0.00 0.14 0.02 0.31 98.08 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn xvdap1 0.21 1.50 0.87 93689 54728 xvdb 0.01 0.03 0.00 1575 8 $ top top - 05:30:08 up 17:20, 3 users, load average: 15.13, 10.24, 9.66 Tasks: 166 total, 20 running, 146 sleeping, 0 stopped, 0 zombie Cpu(s): 65.3%us, 4.7%sy, 0.0%ni, 13.5%id, 0.0%wa, 0.0%hi, 0.7%si, 15.8%st Mem: 7130236k total, 463440k used, 6666796k free, 19100k buffers Swap: 0k total, 0k used, 0k free, 95136k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6457 ubuntu 20 0 257m 11m 4820 S 24 0.2 0:16.73 apache2 6416 ubuntu 20 0 257m 11m 4820 R 23 0.2 0:17.36 apache2 6375 ubuntu 20 0 257m 11m 4820 R 22 0.2 0:17.62 apache2 6402 ubuntu 20 0 257m 11m 4820 R 22 0.2 0:16.85 apache2 6472 ubuntu 20 0 257m 11m 4820 S 22 0.2 0:08.95 apache2 6311 ubuntu 20 0 257m 11m 4820 S 21 0.2 0:24.91 apache2 6446 ubuntu 20 0 257m 11m 4820 R 21 0.2 0:16.91 apache2 6372 ubuntu 20 0 257m 11m 4820 R 21 0.2 0:17.89 apache2 6460 ubuntu 20 0 257m 11m 4820 R 21 0.2 0:16.73 apache2 6379 ubuntu 20 0 257m 11m 4820 R 20 0.2 0:16.24 apache2 6380 ubuntu 20 0 257m 11m 4820 S 20 0.2 0:17.20 apache2 6450 ubuntu 20 0 257m 11m 4820 S 20 0.2 0:16.89 apache2 6426 ubuntu 20 0 257m 11m 4820 R 20 0.2 0:16.96 apache2 6432 ubuntu 20 0 257m 11m 4820 S 20 0.2 0:17.78 apache2 6433 ubuntu 20 0 257m 11m 4820 R 20 0.2 0:14.37 apache2 6476 ubuntu 20 0 257m 11m 4816 R 20 0.2 0:02.92 apache2 6386 ubuntu 20 0 257m 11m 4824 S 20 0.2 0:17.94 apache2 6475 ubuntu 20 0 257m 11m 4820 S 19 0.2 0:03.41 apache2 6355 ubuntu 20 0 257m 11m 4820 S 19 0.2 0:24.39 apache2 6417 ubuntu 20 0 257m 11m 4820 R 18 0.2 0:16.66 apache2 6455 ubuntu 20 0 257m 11m 4820 R 18 0.2 0:16.27 apache2 6393 ubuntu 20 0 257m 11m 4820 S 18 0.2 0:16.60 apache2 6325 ubuntu 20 0 257m 11m 4820 R 18 0.2 0:25.66 apache2 6403 ubuntu 20 0 257m 11m 4820 S 18 0.2 0:15.61 apache2 6474 ubuntu 20 0 257m 11m 4812 S 18 0.2 0:04.37 apache2 6477 ubuntu 20 0 257m 11m 4800 S 18 0.2 0:01.43 apache2 6315 ubuntu 20 0 257m 11m 4820 S 17 0.2 0:25.27 apache2 6376 ubuntu 20 0 257m 11m 4820 R 17 0.2 0:17.53 apache2 6478 ubuntu 20 0 257m 11m 4800 S 15 0.2 0:00.45 apache2 6359 ubuntu 20 0 257m 11m 4820 R 15 0.2 0:23.60 apache2 $ df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 7.9G 1.4G 6.1G 19% / none 3.4G 112K 3.4G 1% /dev none 3.4G 0 3.4G 0% /dev/shm none 3.4G 72K 3.4G 1% /var/run none 3.4G 0 3.4G 0% /var/lock /dev/xvdb 414G 199M 393G 1% /mnt XXXX.compute.internal:/share_0 99G 28G 66G 30% /data_0 XXXX.compute.internal:/share_17 99G 30G 64G 33% /data_17 XXXX.compute.internal:/share_13 99G 30G 64G 33% /data_13 XXXX.compute.internal:/share_18 99G 31G 64G 33% /data_18 XXXX.compute.internal:/share_15 99G 28G 66G 30% /data_15 XXXX.compute.internal:/share_10 99G 28G 67G 30% /data_10 XXXX.compute.internal:/share_16 99G 30G 64G 32% /data_16 XXXX.internal:/share_3 99G 29G 66G 31% /data_3 XXXX.compute.internal:/share_11 99G 30G 64G 32% /data_11 XXXX.compute.internal:/share_7 99G 28G 66G 30% /data_7 XXXX.compute.internal:/share 99G 58G 37G 62% /share XXXX.compute.internal:/share_2 99G 28G 66G 30% /data_2 XXXX.compute.internal:/share_8 99G 28G 67G 30% /data_8 XXXX.compute.internal:/share_19 99G 28G 66G 30% /data_19 XXXX.compute.internal:/share_14 99G 31G 64G 33% /data_14 XXXX.compute.internal:/share_5 99G 28G 66G 30% /data_5 XXXX.compute.internal:/share_6 99G 28G 67G 30% /data_6 XXXX.compute.internal:/share_1 99G 28G 66G 30% /data_1 XXXX.compute.internal:/share_12 99G 31G 64G 33% /data_12 XXXX.compute.internal:/share_4 99G 29G 66G 31% /data_4 XXXX.compute.internal:/share_9 99G 28G 66G 30% /data_9 $ free -g total used free shared buffers cached Mem: 6 0 6 0 0 0 -/+ buffers/cache: 0 6 Swap: 0 0 0 sar 1 Linux 2.6.38-13-virtual 02/17/2012 _x86_64_ (8 CPU) 05:33:02 AM CPU %user %nice %system %iowait %steal %idle 05:33:03 AM all 69.27 0.00 5.90 0.00 13.83 11.00 05:33:04 AM all 70.88 0.00 7.62 0.00 16.50 5.01 05:33:05 AM all 64.41 0.00 5.35 0.00 17.90 12.34 05:33:06 AM all 66.41 0.00 9.16 0.00 13.09 11.34 05:33:07 AM all 74.55 0.00 7.06 0.00 11.21 7.17 05:33:08 AM all 62.31 0.00 7.49 0.00 13.38 16.81 05:33:09 AM all 73.65 0.00 5.61 0.00 16.04 4.70 05:33:10 AM all 76.79 0.00 8.20 0.00 9.70 5.31 05:33:11 AM all 70.91 0.00 5.86 0.00 14.21 9.02 05:33:12 AM all 73.95 0.00 6.37 0.00 12.51 7.17 05:33:13 AM all 63.50 0.00 6.03 0.00 17.52 12.95 05:33:14 AM all 61.92 0.00 4.42 0.00 17.66 16.00 05:33:15 AM all 63.56 0.00 6.42 0.00 15.11 14.91 05:33:16 AM all 72.63 0.00 7.51 0.00 14.90 4.97 05:33:17 AM all 60.68 0.00 6.17 0.00 15.09 18.06 $ sar -w 1 Linux 2.6.38-13-virtual 02/17/2012 _x86_64_ (8 CPU) 09:34:23 AM proc/s cswch/s 09:34:24 AM 0.00 4795.00 09:34:25 AM 0.00 4174.00 09:34:26 AM 0.00 4194.23 09:34:27 AM 1.00 3645.00 09:34:28 AM 0.00 4564.00 09:34:29 AM 0.00 4473.00 09:34:30 AM 0.00 4225.00 09:34:31 AM 0.00 4064.36 09:34:32 AM 0.00 4740.00 09:34:33 AM 0.00 4589.22 09:34:34 AM 0.00 3887.00 09:34:35 AM 0.00 4579.00 09:34:36 AM 0.00 4408.00 09:34:37 AM 1.00 4390.00 09:34:38 AM 0.00 4628.00
-
thinice about 12 yearsHow are we supposed to help without telling us what's taking up your CPU?
-
EEAA about 12 years20 USD says it's iowait.
-
cyberx86 about 12 yearsYou need to provide a lot more information for a proper diagnosis - but as a starting point, keep in mind that Load Average, is more than just CPU - it includes iowait time. My guess is that the excess RAM you had before allowed for significantly more disk caching, which minimized disk I/O. Check and post the output of
iostat -x
(the%iowait
andawait
values) and/ortop
(the%wa
value) and to prove/disprove. Also post more detail: EBS volumes and setup (e.g. RAID) or ephemeral;df
,iostat
,top
,free
, etc (somesar
data, logs, etc may be helpful as well), -
Roman Newaza about 12 yearsWe don't use RAID with this Group. Instance type is EBS Boot. I have started test Instance and used ab to simulate load. Please look at my post - it's been edited.
-
Tometzky about 12 years@ErikA: OK - where can I collect my 20 USD? ;-P
-
Roman Newaza about 12 yearsI would better buy you a beer ;P. Load Average is high sometimes.
-
Roman Newaza about 12 yearsBasically, when there're 8 Instances in the group, Load Average is ~0.5, but when I scale it in to 6 Instances, Load Average might rise to ~20.
-
Roman Newaza about 12 yearsDoes it mean EC2 is overloaded?
-
Roman Newaza about 12 yearsAnd how to persist SCHED_BATCH batch policy?
-
Tometzky about 12 yearsThis SCHED_BATCH tells Linux kernel that this process is not interactive - so that it is better to give it longer slice of CPU time less often, and 0 is the priority - a default one. You'd have to add this to a startup script which starts Apache. I don't know how do you start your server on your system so I can not help here much. On CentOS/RedHat servers I'd add
HTTPD="chrt -b 0 /usr/sbin/httpd"
to/etc/sysconfig/httpd
. -
enedebe about 12 yearsNo, you're instance is overloaded and hypervisor acts as "policeman" stealing required cpu cycles.