OpenMP threads executing on the same cpu core

12,049

Solution 1

After some experimentation I found out that the problem was that I was starting my program from inside the eclipse IDE, which seemingly set the affinity to use only one core. I thought I got the same problems when starting from outside of the IDE, but a repeated test showed that the program works just fine, when started from the terminal instead of from inside the ide.

Solution 2

I compiled your program using g++ 4.6 on Linux

g++ --std=c++0x -fopenmp test.cc -o test

The output was, unsurprisingly:

Thread 2 on cpu 2

Thread 3 on cpu 1
910270973
Thread 1 on cpu 3
910270973
Thread 0 on cpu 0
910270973910270973

The fact that 4 threads are started (if you have not set the number of threads in any way, e.g. using OMP_NUM_THREADS) should imply that the program is able to see 4 usable CPUs. I cannot guess why it is not using them but I suspect a problem in your hardware/software setting, in some environment variable, or in the compiler options.

Share:
12,049
Grizzly
Author by

Grizzly

Updated on June 22, 2022

Comments

  • Grizzly
    Grizzly almost 2 years

    I'm currently parallelizing program using openmp on a 4-core phenom2. However I noticed that my parallelization does not do anything for the performance. Naturally I assumed I missed something (falsesharing, serialization through locks, ...), however I was unable to find anything like that. Furthermore from the CPU Utilization it seemed like the program was executed on only one core. From what I found sched_getcpu() should give me the Id of the core the thread executing the call is currently scheduled on. So I wrote the following test program:

    #include <iostream>
    #include <sstream>
    #include <omp.h>
    #include <utmpx.h>
    #include <random>
    int main(){
        #pragma omp parallel
        {
            std::default_random_engine rand;
            int num = 0;
        #pragma omp for
            for(size_t i = 0; i < 1000000000; ++i) num += rand();
        auto cpu = sched_getcpu();
        std::ostringstream os;
            os<<"\nThread "<<omp_get_thread_num()<<" on cpu "<<sched_getcpu()<<std::endl;
            std::cout<<os.str()<<std::flush;
        std::cout<<num;
        }
    }
    

    On my machine this gives the following output(the random numbers will vary of course):

    Thread 2 on cpu 0 num 127392776
    Thread 0 on cpu 0 num 1980891664
    Thread 3 on cpu 0 num 431821313
    Thread 1 on cpu 0 num -1976497224
    

    From this I assume that all threads execute on the same core (the one with id 0). To be more certain I also tried the approach from this answer. The results where the same. Additionally using #pragma omp parallel num_threads(1) didn't make the execution slower (slightly faster in fact), lending credibility to the theory that all threads use the same cpu, however the fact that the cpu is always displayed as 0 makes me kind of suspicious. Additionally I checked GOMP_CPU_AFFINITY which was initially not set, so I tried setting it to 0 1 2 3, which should bind each thread to a different core from what I understand. However that didn't make a difference.

    Since develop on a windows system, I use linux in virtualbox for my development. So I though that maybe the virtual system couldn't access all cores. However checking the settings of virtualbox showed that the virtual machine should get all 4 cores and executing my test program 4 times at the same time seems to use all 4 cores judging from the cpu utilization (and the fact that the system was getting very unresponsive).

    So for my question is basically what exactly is going on here. More to the point: Is my deduction that all threads use the same core correctly? If it is, what could be the reasons for that behavious?

  • Grizzly
    Grizzly about 12 years
    Why would I use #pragma omp parallel for, if I want the threads to do things outside the loop (like writing their id to the output)? And as I mentioned it does create 4 threads by default, the just seem to be executed on the same core
  • Grizzly
    Grizzly about 12 years
    The vbox does have the correct affinity to use all cores (I checked and besides how would it use all of them in my test with multiple starts of my testprogram). Since I use linux inside the vbox that doesn't really help there.
  • Nav
    Nav about 12 years
    That's true too. btw, if you don't say omp parallel for, then no parallelization happens in the loop. But of course you're inside a parallel section, so.... The only other possible explanation I can think of is a lack of hardware support for your virtualbox. Have you tried with other CPU's? superuser.com/questions/33723/…
  • Grizzly
    Grizzly about 12 years
    I did not. However as mentioned it is possible to use all cores from the vbox, so lack of support does seem unlikely
  • Y00
    Y00 over 3 years
    These can be set via variables like these: web.archive.org/web/20220114064748/https://…