Poor performance in multi-threaded C++ program

12,462

Solution 1

If you have plenty of CPU cores, and have plenty of work to do, it should not take longer to run in multithreaded than single threaded mode - the actual CPU time may be a fraction longer, but the "wall-clock time" should be shorter. I'm pretty sure that your code has some sort of bottleneck where one thread is blocking the other.

This is because of one or more of these things - I'll list them first, then go into detail below:

  1. Some lock in a thread is blocking the second thread from running.
  2. Sharing of data between threads (either true or "false" sharing)
  3. Cache thrashing.
  4. Competition for some external resource causing thrashing and/or blocking.
  5. Badly designed code in general...

Some lock in a thread is blocking the second thread from running.

If there is a thread that takes a lock, and another thread wants to use the resource that is locked by this thread, it will have to wait. This obviously means the thread isn't doing anything useful. Locks should be kept to a minimum by only taking the lock for a short period. Using some code to identify if locks are holding your code, such as:

while (!tryLock(some_some_lock))
{
    tried_locking_failed[lock_id][thread_id]++;
}
total_locks[some_lock]++;

Printing some stats of the locks would help to identify where the locking is contentious - or you can try the old trick of "Press break in the debugger and see where you are" - if a thread is constantly waiting for some lock, then that's what's preventing progress...

Sharing of data between threads (either true or "false" sharing)

If two threads use [and update the value of it frequently] the same variable, then the two threads will have to swap "I've updated this" messages, and the CPU's have to fetch the data from the other CPU before it can continue with it's use of the variable. Since "data" is shared on a "per cache-line" level, and a cache-line is typically 32-bytes, something like:

int var[NUM_THREADS]; 
...
var[thread_id]++; 

would classify as something called "false sharing" - the ACTUAL data updated is unique per CPU, but since the data is within the same 32-byte region, the cores will still have updated the same are of memory.

Cache thrashing.

If two threads do a lot of memory reading and writing, the cache of the CPU may be constantly throwing away good data to fill it with data for the other thread. There are some techniques available to ensure that two threads don't run in "lockstep" on which part of cache the CPU uses. If the data is 2^n (power of two) and fairly large (a multiple of the cache-size), it's a good idea to "add an offset" for each thread - for example 1KB or 2KB. That way, when the second thread reads the same distance into the data region, it will not overwrite exactly the same area of cache that the first thread is currently using.

Competition for some external resource causing thrashing and/or blocking.

If two threads are reading or writing from/to the hard-disk, network card, or some other shared resource, this can lead to one thread blocking another thread, which in turn means lower performance. It is also possible that the code detects different threads and does some extra flushing to ensure that data is written in the correct order or similar, before starting work with the other thread.

It is also possible that there are locks internally in the code that deals with the resource (user-mode library or kernel mode drivers) that block when more than one thread is using the same resource.

Generally bad design

This is a "catchall" for "lots of other things that can be wrong". If the result from one calculation in one thread is needed to progress the other, obviously, not a lot of work can be done in that thread.

Too small a work-unit, so all the time is spent starting and stopping the thread, and not enough work is being done. Say for example that you dole out small numbers to be "calculate if this is a prime" to each thread, one number at a time, it will probably take a lot longer to give the number to the thread than the calculation of "is this actually a prime-number" - the solution is to give a set of numbers (perhaps 10, 20, 32, 64 or such) to each thread, and then report back the result for the whole lot in one go.

There are plenty of other "bad design". Without understanding your code it's quite hard to say for sure.

It is entirely possible that your problem is none of the ones I've mentioned here, but most likely it is one of these. Hopefully this asnwer is helpful to identify the cause.

Solution 2

Read CPU Caches and Why You Care to understand why a naive port of an algorithm from one thread to multiple threads will more often than not result in greatly reduced performance and negative scalability. Algorithms that are specififcally designed for parallelism take care of overactive interlocked operations, false sharing and other causes of cache pollution.

Solution 3

Here are a few things you might wanna look into.

1°) Do you enter any critical section (locks, semaphores, etc.) between your worker thread and your main thread? (this should be the case if your queries modify the graph). If so, that could be one of the sources of the multithreading overhead : threads competing for a lock usually degrades performances.

2°) You're using a 24 cores machines, which I assume would be NUMA (Non-Uniform Memory Access). Since you set the threads affinities during your tests, you should pay close attention to the memory topology of your hardware. Looking at the files in /sys/devices/system/cpu/cpuX/ can help you with that (beware that cpu0 and cpu1 aren't necessarily close together, and thus does not necessarily share memory). Threads heavily using memory should use local memory (allocated in the same NUMA node as the core they're executing on).

3°) You are heavily using disk I/O. Which kind of I/O is that? if every thread perform every time some synchronous I/O, you might wanna consider asynchronous system calls, so that the OS stays in charge of scheduling those requests to the disk.

4°) Some caches issues have already been mentionned in other answers. From experience, false sharing can hurt performances as much as you're observing. My last recommendation (which should have been my first) is to use a profiler tool, such as Linux Perf, or OProfile. With such performance degradation you're experiencing, the cause will certainly appear quite clearly.

Solution 4

The other answers have all addressed the general guidelines that can cause your symptoms. I will give my own, hopefully not excessively redundant version. Then I will talk a bit about how you can get to the bottom of the problem with everything discussed in mind.

In general, there's a few reasons you'd expect multiple threads to perform better:

  • A piece of work is dependent on some resources (disk, memory, cache, etc.) while other pieces can proceed independently of these resources or said workload.
  • You have multiple CPU cores that can process your workload in parallel.

The main reasons, enumerated above, you'd expect multiple threads to perform less well are all based on resource contention:

  • Disk contention: already explained in detail and can be a possible issue, especially if you are writing small buffers at a time instead of batching
  • CPU time contention if the threads are scheduled onto the same core: probably not your issue if you're setting affinity. However, you should still double check
  • Cache thrashing: similarly probably not your problem if you have affinity, though this can be very expensive if it is your problem.
  • Shared memory: again talked about in detail and doesn't seem to be your issue, but it wouldn't hurt to audit the code to check it out.
  • NUMA: again talked about. If your worker thread is pinned to a different core, you will want to check whether the work it needs to access is local to the main core.

Ok so far not much new. It can be any or none of the above. The question is, for your case, how can you detect where the extra time is coming from. There's a few strategies:

  • Audit the code and look for obvious areas. Don't spend too much time doing this as it's generally unfruitful if you wrote the program to begin with.
  • Refactor the single threaded code and the multi-threaded code to isolate one process() function, then profile at key checkpoints to try to account for the difference. Then narrow it down.
  • Refactor the resource access into batches, then profile each batch on both the control and the experiment to account for the difference. Not only will this tell you which areas (disk access vs memory access vs spending time in some tight loop) you need to focus your efforts on, doing this refactor might even improve your running time overall. Example:
    • First copy the graph structure to thread-local memory (perform a straight-up copy in the single-threaded case)
    • Then perform the query
    • Then setup an asynchronous write to disk
  • Try to find a minimally reproducible workload with the same symptoms. This means changing your algorithm to do a subset of what it already does.
  • Make sure there's no other noise in the system that could've caused the difference (if some other user is running a similar system on the work core).

My own intuition for your case:

  • Your graph structure is not NUMA friendly for your worker core.
  • The kernel can actually scheduled your worker thread off the affinity core. This can happen if you don't have isolcpu on for the core you're pinning to.

Solution 5

I can't tell you what's wrong with your program because you haven't shared enough of it to do a detailed analysis.

What I can tell you is if this was my problem the first thing I would try is to run two profiler sessions on my application, one on the single threaded version and another on the dual thread configuration. The profiler report should give you a pretty good idea of where the extra time is going. Note that you may not need to profile the entire application run, depending on the problem the time difference may become obvious after you profile for a few seconds or minutes.

As far as profiler choices for Linux you may want to consider oprofile or as a second choice gprof.

If you find you need help interpreting the profiler output feel free to add that to your question.

Share:
12,462
Aaron
Author by

Aaron

Updated on June 17, 2022

Comments

  • Aaron
    Aaron about 2 years

    I have a C++ program running on Linux in which a new thread is created to do some computationally expensive work independent of the main thread (The computational work completes by writing the results to files, which end up being very large). However, I'm getting relatively poor performance.

    If I implement the program straightforward (without introducing other threads), it completes the task in roughly 2 hours. With the multi-threaded program it takes around 12 hours to do the same task (this was tested with only one thread spawned).

    I've tried a couple of things, including pthread_setaffinity_np to set the thread to a single CPU (out of the 24 available on the server I'm using), as well as pthread_setschedparam to set the scheduling policy (I've only tried SCHED_BATCH). But the effects of these have so far been negligible.

    Are there any general causes for this kind of problem?

    EDIT: I've added some example code that I'm using, which is hopefully the most relevant parts. The function process_job() is what actually does the computational work, but it would be too much to include here. Basically, it reads in two files of data, and uses these to perform queries on an in-memory graph database, in which the results are written to two large files over a period of hours.

    EDIT part 2: Just to clarify, the problem is not that I want to use threads to increase the performance of an algorithm I have. But rather, I want to run many instances of my algorithm simultaneously. Therefore, I expect the algorithm would run at a similar speed when put in a thread as it would if I didn't use multi-threads at all.

    EDIT part 3: Thanks for the suggestions all. I'm currently doing some unit tests (seeing which parts are slowing down) as some have suggested. As the program takes a while to load and execute, it is taking time to see any results from the tests and therefore I apologize for late responses. I think the main point I wanted to clarify is possible reasons why threading could cause a program to run slowly. From what I gather from the comments, it simply shouldn't be. I'll post when I can find a reasonable resolution, thanks again.

    (FINAL) EDIT part 4: It turns out that the problem was not related to threading after all. Describing it would be too cumbersome at this point (including the use of compiler optimization levels), but the ideas posted here were very useful and appreciated.

    struct sched_param sched_param = {
        sched_get_priority_min(SCHED_BATCH)
    };
    
    int set_thread_to_core(const long tid, const int &core_id) {
       cpu_set_t mask;
       CPU_ZERO(&mask);
       CPU_SET(core_id, &mask);
       return pthread_setaffinity_np(tid, sizeof(mask), &mask);
    }
    
    void *worker_thread(void *arg) {
       job_data *temp = (job_data *)arg;  // get the information for the task passed in
       ...
       long tid = pthread_self();
       int set_thread = set_thread_to_core(tid, slot_id);  // assume slot_id is 1 (it is in the test case I run)
       sched_get_priority_min(SCHED_BATCH);
       pthread_setschedparam(tid, SCHED_BATCH, &sched_param);
       int success = process_job(...);  // this is where all the work actually happens
       pthread_exit(NULL);
    }
    
    int main(int argc, char* argv[]) {
       ...
       pthread_t temp;
       pthread_create(&temp, NULL, worker_thread, (void *) &jobs[i]);  // jobs is a vector of a class type containing information for the task
       ...
       return 0;
    }
    
  • Gumby The Green
    Gumby The Green almost 5 years
    Note that cache lines are now generally 64 bytes, not 32.