How much overhead is there when creating a thread?

58,521

Solution 1

To resurrect this old thread, I just did some simple test code:

#include <thread>

int main(int argc, char** argv)
{
  for (volatile int i = 0; i < 500000; i++)
    std::thread([](){}).detach();
  return 0;
}

I compiled it with g++ test.cpp -std=c++11 -lpthread -O3 -o test. I then ran it three times in a row on an old (kernel 2.6.18) heavily loaded (doing a database rebuild) slow laptop (Intel core i5-2540M). Results from three consecutive runs: 5.647s, 5.515s, and 5.561s. So we're looking at a tad over 10 microseconds per thread on this machine, probably much less on yours.

That's not much overhead at all, given that serial ports max out at around 1 bit per 10 microseconds. Now, of course there's various additional thread losses one can get involving passed/captured arguments (although function calls themselves can impose some), cache slowdowns between cores (if multiple threads on different cores are battling over the same memory at the same time), etc. But in general I highly doubt the use case you presented will adversely impact performance at all (and could provide benefits, depending), despite having you already preemptively labeled the concept "really terrible code" without even knowing how much time it takes to launch a thread.

Whether it's a good idea or not depends a lot on the details of your situation. What else is the calling thread responsible for? What precisely is involved in preparing and writing out the packets? How frequently are they written out (with what sort of distribution? uniform, clustered, etc...?) and what's their structure like? How many cores does the system have? Etc. Depending on the details, the optimal solution could be anywhere from "no threads at all" to "shared thread pool" to "thread for each packet".

Note that thread pools aren't magic and can in some cases be a slowdown versus unique threads, since one of the biggest slowdowns with threads is synchronizing cached memory used by multiple threads at the same time, and thread pools by their very nature of having to look for and process updates from a different thread have to do this. So either your primary thread or child processing thread can get stuck having to wait if the processor isn't sure whether the other process has altered a section of memory. By contrast, in an ideal situation, a unique processing thread for a given task only has to share memory with its calling task once (when it's launched) and then they never interfere with each other again.

Solution 2

You definitely do not want to do this. Create a single thread or a pool of threads and just signal when messages are available. Upon receiving the signal, the thread can perform any necessary message processing.

In terms of overhead, thread creation/destruction, especially on Windows, is fairly expensive. Somewhere on the order of tens of microseconds, to be specific. It should, for the most part, only be done at the start/end of an app, with the possible exception of dynamically resized thread pools.

Share:
58,521
jdt141
Author by

jdt141

Updated on June 28, 2021

Comments

  • jdt141
    jdt141 almost 3 years

    I just reviewed some really terrible code - code that sends messages on a serial port by creating a new thread to package and assemble the message in a new thread for every single message sent. Yes, for every message a pthread is created, bits are properly set up, then the thread terminates. I haven't a clue why anyone would do such a thing, but it raises the question - how much overhead is there when actually creating a thread?

  • ruslik
    ruslik over 13 years
    Yes, an "eternal" dedicated worker thread would also solve the possible MT problems.
  • user2284570
    user2284570 over 8 years
    @MichaelGoldshteyn : have you got an idea on how to do this in python ?
  • Mark A. Ropper
    Mark A. Ropper over 5 years
    Tangent: ran across this thread, as a Windows user I was curious how my system fared. Compiling under msvc with standard release optimisations, running on a 6700k it took 31.442s to run fully. The only alterations I made were to add a std::chrono::high_resolution_clock + time_points before and after the loop and std::cout the result before exiting. Rather shocking results. I tried mingw-w64's 7.1.0 g++ with your your exact command line arguments but it crashes after a few seconds so no idea what's wrong there, same with a clang++ v8.0 I had lying around.