Isolating cause of higher CPU usage on RHEL 6 vs RHEL 5

kernel cpu linux-kernel scheduling

6,330

You ask:

Is there some test I can run with each of them to test if this high CPU usage is just a difference in accounting of CPU usage that is making it look artificially high? Or if actual CPU cycles are being stolen by the CFS?

What if you ran a CPU benchmark while you're running your test_select_small program, and see if its performance changes depending on the host OS version?

There are lots of choices: the classic advice is always "use something that represents the kind of load you'll have". But the cool kids always just used povray

6,330

Dave Johansen

Updated on September 18, 2022

Comments

Dave Johansen over 1 year
I'm currently looking to move our system from RHEL 5 to RHEL 6, but I've run into a snag with unexpectedly high CPU usage on the RHEL 6 machines. It appears that this may be due at least in some part to the use of select to do an interruptible sleep. Here's a simple example that shows the behaviour:
```
#include <sys/select.h>

int main()
{
  timeval ts;
  for (unsigned int ii=0; ii<10000; ++ii) {
    ts.tv_sec = 0;
    ts.tv_usec = 1000;
    select(0, 0, 0, 0, &ts);
  }

  return 0;
}
```
On a RHEL 5 machine it will stay at 0% CPU usage, but on the same hardware with RHEL 6 installed it will use about 0.5% of the CPU, so when 30 to 50 programs are running using select to perform a sleep it eats up a large amount of the CPU unnecessarily.

I opened a Bugzilla and I tried running OProfile and it simply shows 100% in main for the application and just over 99% in poll_idle when looking at the kernel (I have idle=poll set in my grub options so everything can be captured).

Any other ideas of what I can do to try and isolate what the cause of the higher CPU usage is?

UPDATE: I found the perf tool and got the following output:
```
# Events: 23K cycles
#
# Overhead  Command        Shared Object                                Symbol
# ........  .......  ...................  ....................................
#
    13.11%  test_select_sma  [kernel.kallsyms]    [k] find_busiest_group
     5.88%  test_select_sma  [kernel.kallsyms]    [k] schedule
     5.00%  test_select_sma  [kernel.kallsyms]    [k] system_call
     3.77%  test_select_sma  [kernel.kallsyms]    [k] copy_to_user
     3.39%  test_select_sma  [kernel.kallsyms]    [k] update_curr
     3.22%  test_select_sma  ld-2.12.so           [.] _dl_sysinfo_int80
     2.83%  test_select_sma  [kernel.kallsyms]    [k] native_sched_clock
     2.72%  test_select_sma  [kernel.kallsyms]    [k] find_next_bit
     2.69%  test_select_sma  [kernel.kallsyms]    [k] cpumask_next_and
     2.58%  test_select_sma  [kernel.kallsyms]    [k] native_write_msr_safe
     2.47%  test_select_sma  [kernel.kallsyms]    [k] sched_clock_local
     2.39%  test_select_sma  [kernel.kallsyms]    [k] read_tsc
     2.26%  test_select_sma  [kernel.kallsyms]    [k] do_select
     2.13%  test_select_sma  [kernel.kallsyms]    [k] restore_nocheck
```
It appears that the higher CPU usage is from the scheduler. I also used the following bash script to kick off 100 of these simultaneously:
```
#!/bin/bash

for i in {1..100}
do
  ./test_select_small &
done
```
On RHEL 5 the CPU usage stays close to 0%, but on RHEL 6 there's a non-trivial amount of CPU usage in both user and sys. Any ideas on how to track down the true source of this and hopefully fix it?

I also tried this test on a current Arch Linux build and Ubuntu 11.10 and saw similar behaviour, so this appears to be some type of kernel issue and not just a RHEL issue.

UPDATE2: I hesitate a bit to bring this up because I know that it's a huge debate, but I tried out a kernel with the BFS patches on Ubuntu 11.10 and it didn't show the same high system CPU usage (user cpu usage seemed about the same).

Is there some test I can run with each of them to test if this high CPU usage is just a difference in accounting of CPU usage that is making it look artificially high? Or if actual CPU cycles are being stolen by the CFS?

UPDATE3: The analysis done involving this question seems to indicate that it's something related to the scheduler, so I created a new question to discuss the results.

UPDATE4: I added some more information to the other question.

UPDATE5: I added some results to the other question from a simpler test that still demonstrates the issue.
- Admin about 12 years
  
  It seems like RedHat has pinpointed this to the GLibC. Did you look for code-changes regarding select there?
- Admin about 12 years
  
  The glibc categorization was done by me when I originally submitted the bugzilla.
- Admin about 12 years
  
  Sounds reasonable to me (rather than a Kernel problem). Do you get similar results with multiple concurrent sleeps? What are the glibc-Versions from Ubuntu 11.10, Arch Linux and RHEL6?
- Admin about 12 years
  
  Yes, the same result with both poll and usleep sleeping for 1 ms. As far as glibc, RHEL 5 is 2.5, RHEL 6 is 2.12, Ubuntu 11.10 is 2.13, and I believe arch is 2.15 but I'd have to check.
- Admin over 11 years
  
  It seems you found the answer to this original question yourself. Post it as answer here and earn your points for it!
Dave Johansen about 12 years

I think that was the idea that I was referring to. Do you have a recommendation of such a benchmark app that gives consistent timing results that I could use?
evanda about 12 years

@DaveJohansen - added note on povray
Dave Johansen about 12 years

Unfortunately, unless it comes with RHEL, it can take at least a week or two to get any software on these systems. I made my own little test program and created a new question to discuss the results.