Isolating cause of higher CPU usage on RHEL 6 vs RHEL 5

6,330

You ask:

Is there some test I can run with each of them to test if this high CPU usage is just a difference in accounting of CPU usage that is making it look artificially high? Or if actual CPU cycles are being stolen by the CFS?

What if you ran a CPU benchmark while you're running your test_select_small program, and see if its performance changes depending on the host OS version?

There are lots of choices: the classic advice is always "use something that represents the kind of load you'll have". But the cool kids always just used povray

Share:
6,330

Related videos on Youtube

Dave Johansen
Author by

Dave Johansen

Updated on September 18, 2022

Comments

  • Dave Johansen
    Dave Johansen over 1 year

    I'm currently looking to move our system from RHEL 5 to RHEL 6, but I've run into a snag with unexpectedly high CPU usage on the RHEL 6 machines. It appears that this may be due at least in some part to the use of select to do an interruptible sleep. Here's a simple example that shows the behaviour:

    #include <sys/select.h>
    
    int main()
    {
      timeval ts;
      for (unsigned int ii=0; ii<10000; ++ii) {
        ts.tv_sec = 0;
        ts.tv_usec = 1000;
        select(0, 0, 0, 0, &ts);
      }
    
      return 0;
    }
    

    On a RHEL 5 machine it will stay at 0% CPU usage, but on the same hardware with RHEL 6 installed it will use about 0.5% of the CPU, so when 30 to 50 programs are running using select to perform a sleep it eats up a large amount of the CPU unnecessarily.

    I opened a Bugzilla and I tried running OProfile and it simply shows 100% in main for the application and just over 99% in poll_idle when looking at the kernel (I have idle=poll set in my grub options so everything can be captured).

    Any other ideas of what I can do to try and isolate what the cause of the higher CPU usage is?

    UPDATE: I found the perf tool and got the following output:

    # Events: 23K cycles
    #
    # Overhead  Command        Shared Object                                Symbol
    # ........  .......  ...................  ....................................
    #
        13.11%  test_select_sma  [kernel.kallsyms]    [k] find_busiest_group
         5.88%  test_select_sma  [kernel.kallsyms]    [k] schedule
         5.00%  test_select_sma  [kernel.kallsyms]    [k] system_call
         3.77%  test_select_sma  [kernel.kallsyms]    [k] copy_to_user
         3.39%  test_select_sma  [kernel.kallsyms]    [k] update_curr
         3.22%  test_select_sma  ld-2.12.so           [.] _dl_sysinfo_int80
         2.83%  test_select_sma  [kernel.kallsyms]    [k] native_sched_clock
         2.72%  test_select_sma  [kernel.kallsyms]    [k] find_next_bit
         2.69%  test_select_sma  [kernel.kallsyms]    [k] cpumask_next_and
         2.58%  test_select_sma  [kernel.kallsyms]    [k] native_write_msr_safe
         2.47%  test_select_sma  [kernel.kallsyms]    [k] sched_clock_local
         2.39%  test_select_sma  [kernel.kallsyms]    [k] read_tsc
         2.26%  test_select_sma  [kernel.kallsyms]    [k] do_select
         2.13%  test_select_sma  [kernel.kallsyms]    [k] restore_nocheck
    

    It appears that the higher CPU usage is from the scheduler. I also used the following bash script to kick off 100 of these simultaneously:

    #!/bin/bash
    
    for i in {1..100}
    do
      ./test_select_small &
    done
    

    On RHEL 5 the CPU usage stays close to 0%, but on RHEL 6 there's a non-trivial amount of CPU usage in both user and sys. Any ideas on how to track down the true source of this and hopefully fix it?

    I also tried this test on a current Arch Linux build and Ubuntu 11.10 and saw similar behaviour, so this appears to be some type of kernel issue and not just a RHEL issue.

    UPDATE2: I hesitate a bit to bring this up because I know that it's a huge debate, but I tried out a kernel with the BFS patches on Ubuntu 11.10 and it didn't show the same high system CPU usage (user cpu usage seemed about the same).

    Is there some test I can run with each of them to test if this high CPU usage is just a difference in accounting of CPU usage that is making it look artificially high? Or if actual CPU cycles are being stolen by the CFS?

    UPDATE3: The analysis done involving this question seems to indicate that it's something related to the scheduler, so I created a new question to discuss the results.

    UPDATE4: I added some more information to the other question.

    UPDATE5: I added some results to the other question from a simpler test that still demonstrates the issue.

    • Admin
      Admin about 12 years
      It seems like RedHat has pinpointed this to the GLibC. Did you look for code-changes regarding select there?
    • Admin
      Admin about 12 years
      The glibc categorization was done by me when I originally submitted the bugzilla.
    • Admin
      Admin about 12 years
      Sounds reasonable to me (rather than a Kernel problem). Do you get similar results with multiple concurrent sleeps? What are the glibc-Versions from Ubuntu 11.10, Arch Linux and RHEL6?
    • Admin
      Admin about 12 years
      Yes, the same result with both poll and usleep sleeping for 1 ms. As far as glibc, RHEL 5 is 2.5, RHEL 6 is 2.12, Ubuntu 11.10 is 2.13, and I believe arch is 2.15 but I'd have to check.
    • Admin
      Admin over 11 years
      It seems you found the answer to this original question yourself. Post it as answer here and earn your points for it!
  • Dave Johansen
    Dave Johansen about 12 years
    I think that was the idea that I was referring to. Do you have a recommendation of such a benchmark app that gives consistent timing results that I could use?
  • evanda
    evanda about 12 years
    @DaveJohansen - added note on povray
  • Dave Johansen
    Dave Johansen about 12 years
    Unfortunately, unless it comes with RHEL, it can take at least a week or two to get any software on these systems. I made my own little test program and created a new question to discuss the results.