How to get 100% CPU usage from a C program

c windows linux performance cray

57,395

Solution 1

If you want 100% CPU, you need to use more than 1 core. To do that, you need multiple threads.

Here's a parallel version using OpenMP:

I had to increase the limit to 1000000 to make it take more than 1 second on my machine.

#include <stdio.h>
#include <time.h>
#include <omp.h>

int main() {
    double start, end;
    double runTime;
    start = omp_get_wtime();
    int num = 1,primes = 0;

    int limit = 1000000;

#pragma omp parallel for schedule(dynamic) reduction(+ : primes)
    for (num = 1; num <= limit; num++) { 
        int i = 2; 
        while(i <= num) { 
            if(num % i == 0)
                break;
            i++; 
        }
        if(i == num)
            primes++;
//      printf("%d prime numbers calculated\n",primes);
    }

    end = omp_get_wtime();
    runTime = end - start;
    printf("This machine calculated all %d prime numbers under %d in %g seconds\n",primes,limit,runTime);

    return 0;
}

Output:

This machine calculated all 78498 prime numbers under 1000000 in 29.753 seconds

Here's your 100% CPU:

enter image description here

Solution 2

You're running one process on a multi-core machine - so it only runs on one core.

The solution is easy enough, since you're just trying to peg the processor - if you have N cores, run your program N times (in parallel, of course).

Example

Here is some code that runs your program NUM_OF_CORES times in parallel. It's POSIXy code - it uses fork - so you should run that under Linux. If what I'm reading about the Cray is correct, it might be easier to port this code than the OpenMP code in the other answer.

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>

#define NUM_OF_CORES 8
#define MAX_PRIME 100000

void do_primes()
{
    unsigned long i, num, primes = 0;
    for (num = 1; num <= MAX_PRIME; ++num) {
        for (i = 2; (i <= num) && (num % i != 0); ++i);
        if (i == num)
            ++primes;
    }
    printf("Calculated %d primes.\n", primes);
}

int main(int argc, char ** argv)
{
    time_t start, end;
    time_t run_time;
    unsigned long i;
    pid_t pids[NUM_OF_CORES];

    /* start of test */
    start = time(NULL);
    for (i = 0; i < NUM_OF_CORES; ++i) {
        if (!(pids[i] = fork())) {
            do_primes();
            exit(0);
        }
        if (pids[i] < 0) {
            perror("Fork");
            exit(1);
        }
    }
    for (i = 0; i < NUM_OF_CORES; ++i) {
        waitpid(pids[i], NULL, 0);
    }
    end = time(NULL);
    run_time = (end - start);
    printf("This machine calculated all prime numbers under %d %d times "
           "in %d seconds\n", MAX_PRIME, NUM_OF_CORES, run_time);
    return 0;
}

Output

$ ./primes 
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
This machine calculated all prime numbers under 100000 8 times in 8 seconds

Solution 3

we really want to see how fast it can go!

Your algorithm to generate prime numbers is very inefficient. Compare it to primegen that generates the 50847534 primes up to 1000000000 in just 8 seconds on a Pentium II-350.

To consume all CPUs easily you could solve an embarrassingly parallel problem e.g., compute Mandelbrot set or use genetic programming to paint Mona Lisa in multiple threads (processes).

Another approach is to take an existing benchmark program for the Cray supercomputer and port it to a modern PC.

Solution 4

The reason you're getting 15% on a hex core processor is because your code uses 1 core at 100%. 100/6 = 16.67%, which using a moving average with process scheduling (your process would be running under normal priority) could easily be reported as 15%.

Therefore, in order to use 100% cpu, you would need to use all the cores of your CPU - launch 6 parallel execution code paths for a hex core CPU and have this scale right up to however many processors your Cray machine has :)

Solution 5

Also be very aware how you're loading the CPU. A CPU can do a lot of different tasks, and while many of them will be reported as "loading the CPU 100%" they may each use 100% of different parts of the CPU. In other words, it's very hard to compare two different CPUs for performance, and especially two different CPU architectures. Executing task A may favor one CPU over another, while executing task B it can easily be the other way around (since the two CPUs may have different resources internally and may execute code very differently).

This is the reason software is just as important for making computers perform optimal as hardware is. This is indeed very true for "supercomputers" as well.

One measure for CPU performance could be instructions per second, but then again instructions aren't created equal on different CPU architectures. Another measure could be cache IO performance, but cache infrastructure is not equal either. Then a measure could be number of instructions per watt used, as power delivery and dissipation is often a limiting factor when designing a cluster computer.

So your first question should be: Which performance parameter is important to you? What do you want to measure? If you want to see which machine gets the most FPS out of Quake 4, the answer is easy; your gaming rig will, as the Cray can't run that program at all ;-)

Cheers, Steen

View more solutions

57,395

bag-man

Updated on July 05, 2022

Comments

bag-man almost 2 years
This is quite an interesting question so let me set the scene. I work at The National Museum of Computing, and we have just managed to get a Cray Y-MP EL super computer from 1992 running, and we really want to see how fast it can go!

We decided the best way to do this was to write a simple C program that would calculate prime numbers and show how long it took to do so, then run the program on a fast modern desktop PC and compare the results.

We quickly came up with this code to count prime numbers:
```
#include <stdio.h>
#include <time.h>

void main() {
    clock_t start, end;
    double runTime;
    start = clock();
    int i, num = 1, primes = 0;

    while (num <= 1000) { 
        i = 2; 
        while (i <= num) { 
            if(num % i == 0)
                break;
            i++; 
        }
        if (i == num)
            primes++;

        system("clear");
        printf("%d prime numbers calculated\n",primes);
        num++;
    }

    end = clock();
    runTime = (end - start) / (double) CLOCKS_PER_SEC;
    printf("This machine calculated all %d prime numbers under 1000 in %g seconds\n", primes, runTime);
}
```
Which on our dual core laptop running Ubuntu (The Cray runs UNICOS), worked perfectly, getting 100% CPU usage and taking about 10 minutes or so. When I got home I decided to try it on my hex-core modern gaming PC, and this is where we get our first issues.

I first adapted the code to run on Windows since that is what the gaming PC was using, but was saddened to find that the process was only getting about 15% of the CPU's power. I figured that must be Windows being Windows, so I booted into a Live CD of Ubuntu thinking that Ubuntu would allow the process to run with its full potential as it had done earlier on my laptop.

However I only got 5% usage! So my question is, how can I adapt the program to run on my gaming machine in either Windows 7 or live Linux at 100% CPU utilisation? Another thing that would be great but not necessary is if the end product can be one .exe that could be easily distributed and ran on Windows machines.

Thanks a lot!

P.S. Of course this program didn't really work with the Crays 8 specialist processors, and that is a whole other issue... If you know anything about optimising code to work on 90's Cray super computers give us a shout too!
- Mysticial over 12 years
  
  Have you tried multi-threading?
- Chriseyre2000 over 12 years
  
  Was your gaming PC multi-core?
- Edward Thomson over 12 years
  
  I can't believe there's not a unicos tag. ;)
- mikithskegg over 12 years
  
  It is a strange thing that this single thread program took 100% of CPU usage on DUAL CORE processor )))
- bag-man over 12 years
  
  My laptop is Dual Core, my PC is Hex Core. @mikithskegg well in top that process was at 100%, I take it that isn't right then?
- mikithskegg over 12 years
  
  Are you sure that your laptop is really dual core and not using something like hyperthread technology?
- Diego over 12 years
  
  In windows, open the task manager, go to processes, right click the process and set its priority to real-time. Does that increase the CPU usage?
- PeterT over 12 years
  
  how did you even measure the CPU usage that thing just prints out This machine calculated all 168 prime numbers under 1000 in 0.002 seconds on my aging laptop
- Keith Thompson over 12 years
  
  void main() should be int main(void). Cray's C compiler doesn't support C99; I'm surprised that it accepted mixed declarations and statements. You need #include <stdlib.h> to make the declaration of system() visible -- or you could just delete the system("clear"), which seems unnecessary. (Printing '\r' should go to the beginning of the current line.) Crays are optimized for floating-point performance; something that does integer calculations doesn't really exercise its strengths.
- Keith Thompson over 12 years
  
  If you want a single-threaded benchmark, consider something like (Whetstone)[en.wikipedia.org/wiki/Whetstone_(benchmark)].
- Gunther Piez over 12 years
  
  Am I the only one who doesn't find this question interesting at all? Come one, running a single threaded program on a n-core machine and asking why it uses 1/n of the the cpu is just... never mind, I just downvote :-)
- Mysticial over 12 years
  
  @drhirsch Well, the question shows research effort. I +1'ed for that - even if the OP is missing something fundamental about multi-core computing.
- Nir Alfasi over 12 years
  
  out of scope: you should check only i <= sqrt(num)
- Carl over 12 years
  
  @drhirsch There are a lot of uninteresting questions on the site. However, interesting or not is subjective. He may be missing the fundamentals and that isn't subjective. Like Mystical said, it does show research effort and isn't as easy to answer as it would appear.
- Damon almost 11 years
  
  Benchmarking a many-core machine using a single thread is not the only thing that makes the approach somewhat nonsensical, but also using a problem that is embarrassingly bad suited for the architecture. Cray computers drew much of their compute power from several (3 or 4 if I recall correctly) deeply pipelined vector processors per CPU core. The benchmarking code is strictly scalar. On a problem that is well-suited (and well parallelized), the Cray would likely perform anywhere from 50 to 100 times better.
- gnasher729 about 10 years
  
  Finding all primes up to 1,000 shouldn't even take measurable time. Whatever displays "x% CPU" probably doesn't notice it. And then there is a difference between how Windows and MacOS X display CPU usage: Windows displays 100% if all CPUs are fully used. MacOS X shows 100% if one CPU is fully used, so a single thread can easily display 100%, and on a quad core with multithreading the display will go close to 800%. Just a different display.
- Warren Dew almost 8 years
  
  @Damon Good point. As I recall, the Cray relied more on pipelining while the competing Cyber machines relied more on parallelization. Having a test each time through the loop will likely cause pipeline stalls and make the Cray run suboptimally unless you optimize specifically for the Cray's branch prediction.
bag-man over 12 years

Ah kind of like when you need to run Prime95, you have multiple instances of it... Surely there is a way for one process to use multiple cores? Like hash cracking programs do.
cha0site over 12 years

Well, one process could use threads to do multiprocessing, but I don't think that's what you meant since a thread is almost a separate process in this context. What we're really talking about here is "heads of execution", be they threads or processes. So, no, there isn't a way to make a single-threaded program run on multiple cores, you have to rewrite it. And sometimes it's really hard. And sometimes it's actually impossible.
bag-man over 12 years

Well I guess it won't be as hard as getting the program to work for the Cray as well. Considering I am pretty new to this (What gave me away :P) where would be a good place to start?
bag-man over 12 years

The issue with doing this is that how can I get a clear figure of the speed of each of the machine? Also the Cray has "vector processors" apparently, so it requires a load more work than this to get it to run properly
bag-man over 12 years

Wow that didn't take you long! I forgot to say the end value we will use is 1,000,000 1000 was for testing (Which as you saw was just silly!) I will test this out now, thanks a lot!
Mysticial over 12 years

Depending on your compiler, there's a compiler flag you have to enable to get OpenMP support.
cha0site over 12 years

@Owen: Well, UNICOS looks like it is somewhat similar to Unix (Wikipedia makes think so anyway), so it probably has fork(). You should go learn how to use that, I think.
mikithskegg over 12 years

And also one can use very useful function omp_get_wtime() to measure time.
cha0site over 12 years

@Owen: Oh, and also, running your program 8 times from the shell shouldn't be much harder than: for i in `seq 8`; do ./my_program; done
bag-man over 12 years

What did you use to compile this? I am using MingGW's gcc
cha0site over 12 years

I think this will be very hard to port to the Cray. Better stick to pthreads if you have to use threads, or better yet - fork.
Mysticial over 12 years

@OwenGarland I use MSVC with /openmp. With GCC, I think the flag is -fopenmp.
Mysticial over 12 years

@cha0site OpenMP is as simple as it's going to get. For these distributed machines, you'll need MPI of some sort. But this program isn't memory bound at all, so it'll probably run well on any (even emulated) shared-memory architecture.
PeterT over 12 years

If you are looking for benchmark data then this will generate This machine calculated all 78498 prime numbers under 1000000 in 173.203 seconds on an Intel i3 M330 @2.13GHz which is a mobile processor on the low-end from a few years ago.
Mysticial over 12 years

The tricky part with porting this to anything other than OpenMP is that the loop iterations aren't even. They get more expensive as num increases and whether or not it's divisible by small numbers. So you'll need to implement some sort of dynamic scheduling with whatever approach you use.
cha0site over 12 years

@Mysticial: But he's likely not to have OpenMP on that machine, and all he wants to do is peg the CPU - he doesn't actually care about the prime calculation side of things. You can do that easily enough with the primitives.
Mysticial over 12 years

@cha0site Yes, I mainly answered the question for the gaming machine. There are definitely more interesting ways to peg the CPU. One of the more notorious benchmarks I've done is my answer to this question - which overheated 2 of 4 machines I tested.
cha0site over 12 years

@Mysticial: Ah, very interesting indeed. However, that's probably still not what we want here - I think what the asker wants is POSIXy code which he can run both on his Linux machine and the Cray (which seems to run a POSIXy OS) and do some very rough comparisons. It's a bit late for me to write code, but I'll try to come up with something tomorrow ^__^
bag-man over 12 years

@Mystical Offtopic: What hardware are you running? My Hex-Core AMD @ 3.2Ghz did it in 92 seconds...
cha0site over 12 years

@Owen: He has a Core i7 2600K... I'm jealous.
bag-man over 12 years

Amazing, my 1 and abit year old CPU that I thought was amazing, is three times as slow as one that is pretty much the same price I paid for it!
Mysticial over 12 years

Core i7 920 @ 3.5GHz (overclocked) I also have a Core i7 2600K, but it's not the machine I usually do work on. I use it mainly for long running code.
Mysticial over 12 years

@OwenGarland As for why it's 3x slower. This particular benchmark isn't very representative of real-life performance. All it does are integer divisions. I just checked Agner Fog's tables, and it looks like AMD's integer division is a lot slower than Intel's. So that probably explains it.
bag-man over 12 years

@Mystical This has been very interesting, thanks a lot for your help! Under Windows the application needs to dll's to run by the way; I have no idea how you combine those into one exe :/
Mateen Ulhaq over 12 years

Augh! Too... much... pink!
cha0site over 12 years

@OwenGarland: I've added an example.
Mysticial over 12 years

Oooh! +1'ed now that you have the example. :)
Carl over 12 years

Don't know. Probably differences in scheduling processes.
Mohammad Fadin over 12 years

You guys been saying, "In Parallel", what does it mean? could someone enlighten me about this.
Mysticial over 12 years

@MohammadFadin en.wikipedia.org/wiki/Parallel_computing Basically, you need to be able to process multiple tasks in parallel to be able to utilize a multi-core computer.
Zameer Ansari over 8 years

@Mysticial Can you solve this confusion please
Numeron over 4 years

It doesn't matter that the algorithm is inefficient because the goal isn't to actually calculate the primes, it's to perform a generically difficult task and see how much better or worse it is at it than a modern desktop. An efficient algorithm would just make that comparison harder, and might even ruin the results if it's so good it purposely takes advantage of modern CPU features/quirks.