How is the "load average" interpreted in "top" output? Is it the same for all distributions?

22,694

Solution 1

The CPU load is the length of the run queue, i.e. the length of the queue of processes waiting to be run.

The uptime command may be used to see the average length of the run queue over the last minute, the last five minutes, and the last 15 minutes, just like what's usually displayed by top.

A high load value means the run queue is long. A low value means that it is short. So, if the one minute load average is 0.05, it means that on average during that minute, there was 0.05 processes waiting to run in the run queue. It is not a percentage. This is, AFAIK, the same on all Unices (although some Unices may not count processes waiting for I/O, which I think Linux does; OpenBSD, for a while only, also counted kernel threads, so that the load was always 1 or more).

The Linux top utility gets the load values from the kernel, which writes them to /proc/loadavg. Looking at the sources for procps-3.2.8, we see that:

  1. To display the load averages, the sprint_uptime() function is called in top.c.
  2. This function lives in proc/whattime.c and calls loadavg() in proc/sysinfo.c.
  3. That function simply opens LOADAVG_FILE to read the load averages.
  4. LOADAVG_FILE is defined earlier as "/proc/loadavg".

Solution 2

The load average is typically calculated by the kernel. Applications such as top and uptime may use the getloadavg(3) library call to access this (it's meant to be portable across different Unix versions). On Linux this typically results in a read from /proc/loadavg. On FreeBSD it's a system call.

For example:

#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>

int main()
{
  double ld[3];

  int i=getloadavg(ld,3);
  printf("Load %.2f %.2f %.2f\n",ld[0],ld[1],ld[2]);
}

uptime and top both make similar calls to get their data.

Now the 1/5/15 minute load averages are the number of processes on the run queue. Different operating systems may calculate this in different ways; the biggest difference normally is whether processes waiting for I/O (eg blocked on disk) count as runnable or not. On Linux they are.

So a load average of 3.4 means there was an average of 3.4 processes on the run queue within the sample window (1, 5, 15 minutes).

A high load average doesn't necessarily mean an overloaded server, though. If you have 16 cores then your load average can be 16 without stress. You could also have a lot of application fork() calls which can result in a large number of processes being created/destroyed, leading to a high load average, but without massively impacting server performance. It should only be used as a guide, along with other metrics such as %CPU busy.

Solution 3

The load average is not something that is specific to any particular tool or distribution, it's a measurement provided by the Kernel, or more precisely, the scheduler, therefore it's a distribution independent measurement. The measurement is recorded inside the proc filesystem /proc

Onto it's interpretation,the load average metric is not an indication of how hard the CPU is working but how much work needs to be done. I don't think there is really a need to multiply it by anything because it's a direct measurement of the number of processes in either a runnable or uninterruptible state.

Try checking out the following two man pages: getloadavg(3) and uptime for more information.

The load average metric can be a difficult concept to understand at first, I think a lot of people think it's an indication of how hard the CPU is working, but that's not really it.

Share:
22,694

Related videos on Youtube

Angelos Asonitis
Author by

Angelos Asonitis

Updated on September 18, 2022

Comments

  • Angelos Asonitis
    Angelos Asonitis almost 2 years

    I would like to know if the output of a Red-Hat based linux could be differently interpreted by a Debian based linux.

    To make the question even more specific, what I am after, is understanding how the "load average" from the first line of the top command on a Red-Hat system is interpreted and how to verify this by official documentation ro code.

    [There are many ways to approach this subject, all of which are acceptable answers to the question]

    One potential approach, would be to find where this information is officially documented.
    Another one, would be to find the code version that top is built from in the specific distribution and version I am working on.

    The command output I am getting is:

        top - 13:08:34 up  1:19,  2 users,  load average: 0.02, 0.00, 0.00
        Tasks: 183 total,   1 running, 182 sleeping,   0 stopped,   0 zombie
        Cpu(s):  0.2%us,  0.2%sy,  0.0%ni, 96.8%id,  2.7%wa,  0.0%hi,  0.0%si,  0.0%st
        Mem:   3922520k total,   788956k used,  3133564k free,   120720k buffers
        Swap:  2097148k total,        0k used,  2097148k free,   344216k cached
    

    In this case how can I interpret the load average value?

    I have managed to locate that the average load is about the last minute, from one documentation source and that it should be interpreted after being multiplied with 100, by another documentation source.
    So, the question is:

    Is it 0.02% or 2% loaded?

    Documentation sources and versions:

    1. The first one stars with

       TOP(1)                        Linux User’s Manual                       TOP(1)
      
       NAME
              top - display Linux tasks
      

      Source: man top in my RedHat distribution
      Ubuntu also has the version with "tasks" that does not explain the load average in:
      http://manpages.ubuntu.com/manpages/precise/man1/top.1.html

    2. The second one starts with

       TOP(1)                          User Commands                         TOP(1)
      
      NAME         top
      
      top - display Linux processes
      

      Source:
      http://man7.org/linux/man-pages/man1/top.1.htm

    3. This one starts with:

      TOP(1)
      
      NAME
      
      top - display and update information about the top cpu processes
      

      Source: http://www.unixtop.org/man.shtml

    The first one, can be seen by man top in RHEL or in online ubuntu documentation and it does not have any explanation for the output format (nor about the load average in which I am interested in).

    The second one, contains a brief explanation, pointing out that the load average has to do with the last 1 minute, but nothing about the interpretation of its value!

    I quote directly from the second source:

    2a. UPTIME and LOAD Averages
    This portion consists of a single line containing:
    program or window name, depending on display mode
    current time and length of time since last boot
    total number of users
    system load avg over the last 1, 5 and 15 minutes

    So, if this explanation is indeed correct, it is just enough to understand that the load average is about the last 1 minute.
    But it does not explain the format of the number.

    In the third explanation, it says that:

    When specifying numbers for load averages, they should be multiplied by 100.

    This explanation suggests that 0.02 means 2% and not 0.02%. But is this correct? Additionally, is it correct for all distributions of linux and potentially different implementations of top?
    To find the answer to this question, I tried to go through the code by searching it online. But I found, at least, two different version of top related to RHEL out there! the builtin-top.c and the refactored top.c. Both copyrighted by Red-Hat as the notice says in the beginning of the code and thus seems logical that RHEL uses one of these.
    http://lxr.free-electrons.com/source/tools/perf/builtin-top.c
    http://lxr.free-electrons.com/source/tools/perf/util/top.c

    So, before delving into that much code, I wanted an opinion about where to focus to form an accurate understanding on how cpu load is interpreted?

    From information given in the answers below, in addition to some personal search, I have found that:

    1. The top that I am using is contained in the package procps-3.2.8. Which can be verified by using top -v.
    2. In the version of procps-3.2.8 that I have downloaded from the official website it seems that the tool uptime get its information from the procfs file /proc/loadavg directly (not utilizing the linux function getloadavg()).
    3. Now for the top command it also does not use the function getloadavg(). I managed to verify that the top does indeed the same things as the uptime tool to show the load averages. It actually calls the uptime tool's function, which gets its information from the procfs file /proc/loadavg.

      So, everything points to the /proc/loadavg file! Thus, to form an accurate understanding of the load average produced by top, one must read the kernel code to see how the file loadavg is written.

    There is also an excellent article pointed out in one of the answers that provides a layman's terms explanation of the three values of loadavg.
    So, despite the fact that all answers have been equally useful and helpful, I am going to mark the one that pointed to the article http://www.linuxjournal.com//article/9001 as "the" answer to my question.Thank you all for your contribution!

    Additionally from the question Understanding top and load average, I have found a link to the source code of the kernel that points to the spot where loadavg is calculated. As it seems there is a huge comment explaining the way it works, also this part of the code is in C!
    The link to the code is http://lxr.free-electrons.com/source/kernel/sched/loadavg.c
    Again I am not trying to engage in any form of plagiarism, I am just adding this for completeness. So, I am repeating that the link to the kernel code was found from one of the answers in Understanding top and load average.

  • Kusalananda
    Kusalananda almost 8 years
    @AgelosAssonitis This LinuxJournal article from 2006 may be interesting. It points to a couple of source files in the Linux Kernel source tree that you may want to look at. I'm no Linux user, so I can't tell if those files are still there in the latest incarnation of the kernel: linuxjournal.com/article/9001
  • Angelos Asonitis
    Angelos Asonitis almost 8 years
    Ok, but are we absolutely certain that the load average described in getloadavg() function description, is the same load average that the top command in rhel shows? I am asking this because I performed a full text search in the un-tarred contents of the package procps-3.2.8 (which is indicated by the command top -v) and there is not a single mention of the function getloadavg()! So, perhaps top calculates the load average in a different manner....
  • Angelos Asonitis
    Angelos Asonitis almost 8 years
    Interesting article indeed! I am just looking into the procps package to see if indeed the top command takes its value from the /proc/loadavg file and if this is the same as the result of the getloadavg() function.
  • Kusalananda
    Kusalananda almost 8 years
    @AgelosAssonitis The definite place for the documentation of load averages is your kernel source tree. POSIX does not define what the CPU load is, and doesn't mention the words "load average" anywhere. The top and uptime utilities are not POSIX utilities and the getloadavg() library function is also not defined in POSIX (it has a BSD lineage though).
  • Angelos Asonitis
    Angelos Asonitis almost 8 years
    So, from what I gather there is no official documentation to understand the forming of these values other than the kernel source itself, is it correct? Before getting there,though, I still need to tie the value presented in top to the procfs file loadavg...
  • Kusalananda
    Kusalananda almost 8 years
    @AgelosAssonitis This is Linux-specific. top on an OpenBSD machine gets the load average through a sysctl() call to the kernel, for example.
  • Angelos Asonitis
    Angelos Asonitis almost 8 years
    @Kusalanda Ok, thanks for this clarification too (mostly for the record) because I am interested in RHEL but it does make the topic more complete. BTW, at the same time I was going through the same functions... lol!
  • leonbloy
    leonbloy almost 8 years
    I wonder if it's correct to call that a "CPU load". In Linux, at least, the queue measures the process waiting not only for CPU but also (and in practice mostly) for I/O howtogeek.com/194642/…