Load average is high but resources are not fully used

9,312

Solution 1

You might want to enable Apache's mod_status ( http://httpd.apache.org/docs/2.0/mod/mod_status.html ) so you can see exactly what's happening inside your webserver. Specifically, you'll be get numbers on per-request CPU consumption.

A few snapshots from vmstat/iostat wouldn't hurt, either.

Also, are you using MyISAM or InnoDB tables? When you get one of these load spikes, what do you get from "SHOW FULL PROCESSLIST\G" in MySQL? I have a feeling you're getting lock/query contention in MySQL which is blowing up the length of your kernel run queue.

Solution 2

I don't have a full solution for you, but I have some guesses.

  1. Your mysql server seems to have only something like 128MB pool. If the LAMP system makes use of a fair-sized database, this seems to be on the low side. That would generate a lot of I/O towards disks. Also, if there are CPU spikes on mysql, turn on slow-query logging for a bit and see what shows up. A new index or two might be in order.
  2. For a top replacement that can read most of the per-process conuters in a modern kernel, I recommend atop. Among other things, it can show disk access by process. Note that atop has a running daemon as part of its setup, so you may want to uninstall it after you are done.
  3. Be careful what CPU usage numbers you trust. They are generated using somewhat different methods. In my experience, to show overall CPU usage, vmstat gives the "best" (== closest to perceived load) numbers.
  4. There are apache processes doing serious work. Perhaps some PHP code optimization is in order?

However, it is not obvious to me from above data that there is much wrong with your setup. While you can probably wring a bit more performance out of the box, you may simply be approaching the limit.

Update:

Clarification re: comment below.

A typical network-oriented TCP server consists of a daemon that has a listening socket and a number of open connections to clients. Each of these sockets has a process waiting on it (one process may wait on numerous sockets). Those processes will be in sleeping state and will be woken up by the OS when some data arrives. If it is efficient (say static web server) you may never catch it running, as it takes only some 100 microseconds to wake up, serve some data and go back to sleep.

Update 2:

A modern OS allocates free memory to new disk buffers until it runs out of memory and then reuses the least used buffers. Thus, memory will always be full. Furthermore, there are several ways in which two processes may report the same page of memory as part of its size. The upshot of this is that a) a modern OS is always out of memory, and b) it is difficult to tell exactly how memory is used. The best simple indication is to strive for buffer and cached numbers as a large fraction of physical memory. On this box more than 30% of memory is in cached disk data.

Solution 3

I had this same problem. mytop showed lots of queries in the queue. I added indexes to my tables and the problem went away.

Solution 4

Any command not in state S (sleep) will be counted as an active process. This includes those in R running state, and D blocking state. (The latter usually occurring when it's waiting from IO from a disk or network device) You may also have Zombie processes hanging around running up the load average.

To find a list of those specifically, try the following command: ps -efl | cut -c3- | egrep -v "^S" You don't have a lot of iowait time listed, so it might turn out to be zombies.

The 100% CPU usage from mysqld might also explain your intermittent hangups. (Maybe it only 'sometimes' gets pegged?) The load average might be a red herring, or not the root cause of your problem.

Also, it appears your machine is using 3.5GB out of 4GB of your RAM. free -m can give you a bit better view of what's getting used.

Share:
9,312

Related videos on Youtube

lima
Author by

lima

Updated on September 18, 2022

Comments

  • lima
    lima over 1 year

    As far as I can tell, the load average on my server (Ubuntu Linux 8.04.1) is way too high, and in practice I see it slows down or stops serving during peak hours.

    It's a fairly stock LAMP powering a single site (image hosting) that obviously servers a lot of content (images) from disk, but they need to go through PHP to be served. Aside from general advice to use a cache/proxy approach for this, I'm lost at why it's apparently using less than half of the available resources (4GB RAM, it's a Linode 4096).

    I'm quite a noob at Linux, so please ask for whatever might be useful. This is a portion of htop (MySQL shows 98.9% CPU usage but that was marginal, it uses 0.*% almost all the time):

      1  [|||||||||||||||||||||||||||||||||||         69.0%]     Tasks: 355 total, 6 running
    
      2  [|||||||||||||||||||||||                     44.8%]     Load average: 18.32 15.02 11.58 
      3  [||||||||||||||||||||||||||||||||||||        71.9%]     Uptime: 01:10:22
      4  [|||||||||||||||||||||||||||||               57.9%]
      Mem[||||||||||||||||||||||||||||||||||||||2190/4096MB]
      Swp[|                                         0/127MB]
    
      PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command                                                  
     2345 mysql     18   0  177M 72640  5140 S 98.9  1.7  7:47.58 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql 
     9350 www-data  16   0 48940 24304  4376 R 13.7  0.6  0:01.05 /usr/sbin/apache2 -k start
     9301 mysql     15   0  177M 72640  5140 S 10.0  1.7  0:00.17 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql 
     9186 mysql     17   0  177M 72640  5140 S 10.0  1.7  0:00.22 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql 
     9150 www-data  15   0 58400 33900  4476 S  8.1  0.8  0:02.03 /usr/sbin/apache2 -k start
     9077 mysql     15   0  177M 72640  5140 S  8.1  1.7  0:00.39 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql 
     9270 mysql     15   0  177M 72640  5140 S  7.5  1.7  0:00.12 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql 
     9037 mysql     16   0  177M 72640  5140 S  7.5  1.7  0:00.45 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql 
     9333 www-data  15   0 35724 11260  4560 S  6.2  0.3  0:03.88 /usr/sbin/apache2 -k start
    

    This is the current apache2.conf, though I've tried lots of combinations and asked here in the past:

    Timeout 90
    KeepAlive On
    MaxKeepAliveRequests 150
    KeepAliveTimeout 3
    <IfModule mpm_prefork_module>
        StartServers          1
        MinSpareServers       1
        MaxSpareServers      5
        MaxClients          275
        ServerLimit          275
        MaxRequestsPerChild   1250
    </IfModule>
    

    UPDATE: As asked, this is a portion of top:

    top - 15:07:31 up  1:46,  2 users,  load average: 12.83, 10.64, 10.14
    Tasks: 223 total,  17 running, 206 sleeping,   0 stopped,   0 zombie
    Cpu(s): 84.3%us,  8.8%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  1.0%si,  5.9%st
    Mem:   4194528k total,  3555696k used,   638832k free,    27748k buffers
    Swap:   131064k total,      588k used,   130476k free,  1458672k cached
    
      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                   
     2345 mysql     17   0  180m  76m 5140 S   55  1.9  13:09.79 mysqld                                                    
    12479 www-data  18   0 73224  47m 4552 S   48  1.2   0:03.74 apache2                                                   
    12294 www-data  17   0 71788  46m 4472 R   39  1.1   0:05.78 apache2                                                   
    12382 www-data  17   0 73744  48m 4460 R   33  1.2   0:03.19 apache2                                                   
    

    UPDATE: As suggested (by Christopher Karel, thanks), here are the active processes (output from ps -efl | cut -c3- | egrep -v "^S"). On average, it shows 1-5 apache2 processes. Does this make sense given my current apache2.conf and load average?

    T root     12519 12508  0  75   0 -   612 finish 15:07 pts/1    00:00:00 top
    R www-data 18677  2774  1  76   0 - 17130 -      16:23 ?        00:00:04 /usr/sbin/apache2 -k start
    R www-data 18965  2774  2  76   0 - 13397 -      16:26 ?        00:00:04 /usr/sbin/apache2 -k start
    R www-data 19047  2774  2  76   0 - 11613 -      16:28 ?        00:00:00 /usr/sbin/apache2 -k start
    R www-data 19088  2774 55  76   0 - 10482 -      16:29 ?        00:00:00 /usr/sbin/apache2 -k start
    R www-data 19091  2774  0  81   0 -  8579 -      16:29 ?        00:00:00 /usr/sbin/apache2 -k start
    R www-data 19092  2774  0  81   0 -  8355 -      16:29 ?        00:00:00 /usr/sbin/apache2 -k start
    R www-data 19093  2774  0  82   0 -  8322 -      16:29 ?        00:00:00 /usr/sbin/apache2 -k start
    R root     19094 18557  0  77   0 -   593 -      16:29 pts/2    00:00:00 ps -efl
    R root     19095 18557  0  78   0 -   729 -      16:29 pts/2    00:00:00 -bash
    R root     19096 18557  0  78   0 -   729 -      16:29 pts/2    00:00:00 -bash
    
    • EEAA
      EEAA almost 13 years
      Run top and post the wa% value from the top summary area.
    • HTTP500
      HTTP500 almost 13 years
      What is your IOWAIT? Check %wa in top or %iowait in iostat Cheers
    • HTTP500
      HTTP500 almost 13 years
      What does "but they need to go through PHP to be served" mean? Are you processing the images before serving?
    • lima
      lima almost 13 years
      @jasondbecker Yes, some need some processing (resizing, watermarking, etc) and some just need routing (validating the URL against the database, since I'm not using the real path to the image)
    • HTTP500
      HTTP500 almost 13 years
      @fandelost, Well, it looks like the resizing is computationally expensive perhaps. How many vCPUs do you have? cat /proc/cpuinfo
    • lima
      lima almost 13 years
      @jasondbecker 4 x Intel(R) Xeon(R) CPU 5130 @ 2.00GHz. Actually, I was aware that dynamically resizing might bring trouble so I was avoiding it, but the load problem goes way back. The resizing thing was rolled out today, before that it only did the routing (MySQL queries are optimized for that). Watermarking is done only once, when the image is uploaded, and kept in a separate dir (sorry for mixing that in the other comment).
  • lima
    lima almost 13 years
    Thank you! As you can see, I've 0 zombie processes (you can see that in the top output) and most of the processes actually are in state S (I was suspicious this has something to do with the problem). As I've said, MySQL processes don't consume more than 1% CPU most (almost all) of the time, that was just an odd capture from htop. Now that you mention it, free -m also indicates 3.78GB used, which is odd since both htop and Virtualmin say 1.85GB being used, what's with this? Thanks again!
  • Christopher Karel
    Christopher Karel almost 13 years
    Ooops, right on the Zombie count. But you should still fire off that ps output -- it will essentially show those 10 processes that ARE running. (Not in state S) That's a good first step. As for the htop memory indicators, I expect that's a difference between actively used memory, and linux's memory caching. See something like: serverfault.com/questions/85470/….
  • lima
    lima almost 13 years
    Interesting, ps shows just 4 apache2 processes running (plus top and the other commands). Does that make any sense? Even MySQL processes are sleep. I could see this happening if there was no traffic, but that's not the case, specially since the load average is still >10 and the site gets slows every now and then.
  • lima
    lima almost 13 years
    Thank you, I'll look into atop. I did look the slow queries log, added indexes and optimized some queries not so long ago. The thing that puzzles me most is that, if I'm reaching the server limits, how's that RAM is only half used and most (like 90%) of the processes are (S)leep? I'm not complaining about your conclusion because that was my belief all along, but the numbers doesn't make any sense to me now. No matter what I change in apache2.conf (server/child/process limits), the results never show (in practice).
  • Bittrance
    Bittrance almost 13 years
    sleeping processes may well be waiting for data from network sockets (e.g. from the user's browser) and so should be asleep. Don't know how htop calculates its mem value (sum up RSS perhaps?), but I would not trust it. top seems to give you more reasonable memory numbers.
  • sciurus
    sciurus almost 13 years
    Seconding atop; it will give you a more comprehensive view of the system than htop. See lwn.net/Articles/387202 for an overview,