Does async(launch::async) in C++11 make thread pools obsolete for avoiding expensive thread creation?

47,102

Question 1:

I changed this from the original because the original was wrong. I was under the impression that Linux thread creation was very cheap and after testing I determined that the overhead of function call in a new thread vs. a normal one is enormous. The overhead for creating a thread to handle a function call is something like 10000 or more times slower than a plain function call. So, if you're issuing a lot of small function calls, a thread pool might be a good idea.

It's quite apparent that the standard C++ library that ships with g++ doesn't have thread pools. But I can definitely see a case for them. Even with the overhead of having to shove the call through some kind of inter-thread queue, it would likely be cheaper than starting up a new thread. And the standard allows this.

IMHO, the Linux kernel people should work on making thread creation cheaper than it currently is. But, the standard C++ library should also consider using pool to implement launch::async | launch::deferred.

And the OP is correct, using ::std::thread to launch a thread of course forces the creation of a new thread instead of using one from a pool. So ::std::async(::std::launch::async, ...) is preferred.

Question 2:

Yes, basically this 'implicitly' launches a thread. But really, it's still quite obvious what's happening. So I don't really think the word implicitly is a particularly good word.

I'm also not convinced that forcing you to wait for a return before destruction is necessarily an error. I don't know that you should be using the async call to create 'daemon' threads that aren't expected to return. And if they are expected to return, it's not OK to be ignoring exceptions.

Question 3:

Personally, I like thread launches to be explicit. I place a lot of value on islands where you can guarantee serial access. Otherwise you end up with mutable state that you always have to be wrapping a mutex around somewhere and remembering to use it.

I liked the work queue model a whole lot better than the 'future' model because there are 'islands of serial' lying around so you can more effectively handle mutable state.

But really, it depends on exactly what you're doing.

Performance Test

So, I tested the performance of various methods of calling things and came up with these numbers on an 8 core (AMD Ryzen 7 2700X) system running Fedora 29 compiled with clang version 7.0.1 and libc++ (not libstdc++):

   Do nothing calls per second:   35365257                                      
        Empty calls per second:   35210682                                      
   New thread calls per second:      62356                                      
 Async launch calls per second:      68869                                      
Worker thread calls per second:     970415                                      

And native, on my MacBook Pro 15" (Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz) with Apple LLVM version 10.0.0 (clang-1000.10.44.4) under OSX 10.13.6, I get this:

   Do nothing calls per second:   22078079
        Empty calls per second:   21847547
   New thread calls per second:      43326
 Async launch calls per second:      58684
Worker thread calls per second:    2053775

For the worker thread, I started up a thread, then used a lockless queue to send requests to another thread and then wait for a "It's done" reply to be sent back.

The "Do nothing" is just to test the overhead of the test harness.

It's clear that the overhead of launching a thread is enormous. And even the worker thread with the inter-thread queue slows things down by a factor of 20 or so on Fedora 25 in a VM, and by about 8 on native OS X.

I created an OSDN chamber holding the code I used for the performance test. It can be found here: https://osdn.net/users/omnifarious/pf/launch_thread_performance/

Share:
47,102
Philipp Claßen
Author by

Philipp Claßen

Member of the Ghostery engineering team.

Updated on January 11, 2022

Comments

  • Philipp Claßen
    Philipp Claßen over 2 years

    It is loosely related to this question: Are std::thread pooled in C++11?. Though the question differs, the intention is the same:

    Question 1: Does it still make sense to use your own (or 3rd-party library) thread pools to avoid expensive thread creation?

    The conclusion in the other question was that you cannot rely on std::thread to be pooled (it might or it might be not). However, std::async(launch::async) seems to have a much higher chance to be pooled.

    It don't think that it is forced by the standard, but IMHO I would expect that all good C++11 implementations would use thread pooling if thread creation is slow. Only on platforms where it is inexpensive to create a new thread, I would expect that they always spawn a new thread.

    Question 2: This is just what I think, but I have no facts to prove it. I may very well be mistaken. Is it an educated guess?

    Finally, here I have provided some sample code that first shows how I think thread creation can be expressed by async(launch::async):

    Example 1:

     thread t([]{ f(); });
     // ...
     t.join();
    

    becomes

     auto future = async(launch::async, []{ f(); });
     // ...
     future.wait();
    

    Example 2: Fire and forget thread

     thread([]{ f(); }).detach();
    

    becomes

     // a bit clumsy...
     auto dummy = async(launch::async, []{ f(); });
    
     // ... but I hope soon it can be simplified to
     async(launch::async, []{ f(); });
    

    Question 3: Would you prefer the async versions to the thread versions?


    The rest is no longer part of the question, but only for clarification:

    Why must the return value be assigned to a dummy variable?

    Unfortunately, the current C++11 standard forces that you capture the return value of std::async, as otherwise the destructor is executed, which blocks until the action terminates. It is by some considered an error in the standard (e.g., by Herb Sutter).

    This example from cppreference.com illustrates it nicely:

    {
      std::async(std::launch::async, []{ f(); });
      std::async(std::launch::async, []{ g(); });  // does not run until f() completes
    }
    

    Another clarification:

    I know that thread pools may have other legitimate uses but in this question I am only interested in the aspect of avoiding expensive thread creation costs.

    I think there are still situations where thread pools are very useful, especially if you need more control over resources. For example, a server might decide to handle only a fixed number of requests simultaneously to guarantee fast response times and to increase the predictability of memory usage. Thread pools should be fine, here.

    Thread-local variables may also be an argument for your own thread pools, but I'm not sure whether it is relevant in practice:

    • Creating a new thread with std::thread starts without initialized thread-local variables. Maybe this is not what you want.
    • In threads spawned by async, it is somewhat unclear for me because the thread could have been reused. From my understanding, thread-local variables are not guaranteed to be resetted, but I may be mistaken.
    • Using your own (fixed-size) thread pools, on the other hand, gives you full control if you really need it.
  • Matthieu M.
    Matthieu M. over 11 years
    I concur on the work-queue model, however this requires having a "pipeline" model which may not be applicable to every use of concurrent access.
  • Omnifarious
    Omnifarious over 11 years
    @MatthieuM.: I've been working on a library to sort of combine them. You can put something on a work-queue for another thread that results in something being queued up for the original thread's work-queue when it's finished. That sort of looks like a future.
  • Matthieu M.
    Matthieu M. over 11 years
    I see this more as an asynchronous acknowledgement. Looks nice indeed, hope you'll publicize it.
  • Omnifarious
    Omnifarious over 11 years
    @MatthieuM.: I want to implement auto-wrapping for functions so you can use it to write whole expressions who's evaluaion is delayed until all the results are available for it to be evaluated. That's what's been blocking me from having something I consider release ready, and why I've been asking all those weird complicated template questions.
  • Matthieu M.
    Matthieu M. over 11 years
    Looks to me like expression templates (for operators) could be used to compose the results, for function calls you would need a call method I guess but because of overloads it might be slightly more difficult.
  • Omnifarious
    Omnifarious over 11 years
  • Jeff Hammond
    Jeff Hammond over 9 years
    "very cheap" is relative to your experience. I find Linux thread creation overhead to be substantial for my usage.
  • Ivan
    Ivan about 9 years
    I have a public domain one that uses state of the art c++11 and it adds about 50 microseconds of overhead pool.enque to actual work beind done, which for me is waaay too much. Is there anything in the 5-10 microsecond range?
  • Abhinav Gauniyal
    Abhinav Gauniyal over 7 years
    "I know that in Linux thread creation has been made very cheap" - can I get some resource to read about this?
  • Omnifarious
    Omnifarious about 7 years
    @AbhinavGauniyal - My information on that was rather old and came from around 2003, when Ulrich Drepper got a bug up his but and implemented the futex system call and generally improved Linux threading significantly. Also: cs.utexas.edu/~witchel/372/lectures/POSIX_Linux_Threading.pd‌​f
  • Omnifarious
    Omnifarious about 6 years
    @Jeff - I thought it was a lot cheaper than it is. I updated my answer awhile ago to reflect a test I did to discover the actual cost.
  • cmaster - reinstate monica
    cmaster - reinstate monica about 6 years
    In the first part, you are somewhat underestimating how much has to be done to create a threat, and how little has to be done to call a function. A function call and return is a few CPU instructions that manipulate a few bytes on the top of the stack. A threat creation means: 1. allocating a stack, 2. performing a syscall, 3. creating data structures in the kernel and linking them up, grapping locks along the way, 4. waiting for the scheduler to execute the thread, 5. switching context to the thread. Each of these steps in itself takes much longer than the most complex function calls.
  • cmaster - reinstate monica
    cmaster - reinstate monica about 6 years
    As such, I find a bit thick to say "the Linux kernel people should work on making thread creation cheaper than it currently is". Those guys are working hard to deliver a performance that other OSes can only dream of, but they have to stay within the specs too. Thread creation is a costly operation by definition.
  • Omnifarious
    Omnifarious about 6 years
    @cmaster - Why does a lot have to be done to create a thread? There are architectures in which creating and destroying threads of execution by the hundreds or thousands are the norm. Maybe it's the concept of what a thread is that needs revisiting. Maybe there should be a CPU change that lets you use one instruction recruit another core so you can split your medium sized (1000-50000 iteration) loop in half. All of these things are possible.
  • cmaster - reinstate monica
    cmaster - reinstate monica about 6 years
    @Omnifarious A function call takes about 20 CPU cycles (you measured less because some of the overhead was hidden behind your test harness). A memory allocation can easily take 200 CPU cycles. A syscall is no less than about 200ns. Grabbing a lock is inter-threat communication, which needs to be performed within the kernel, expect something around a microsecond. And I have not yet started on the overhead of setting page tables or flushing the TLB. If specialized hardware allows for faster thread creation, it's because the hardware is optimized for that, X86 CPUs are not.
  • Omnifarious
    Omnifarious about 6 years
    @cmaster - Grabbing a lock, on Linux, requires no system call at all. There is only a system call if there's contention. I've tested this with a highly contended int counter wrapped in a ::std::mutex. In counting to 10000 with two threads there were something like 10 calls to the futex system call. The Intel people listen to the Linux people. If a CPU change would help, it can happen. Why do you need to do anything with the TLB?
  • cmaster - reinstate monica
    cmaster - reinstate monica about 6 years
    I was talking about the locks that need to be grabbed within the kernel. Remember, the kernel itself is a parallel application. There are some overheads in the hardware itself that make even atomic operations for inter-thread communication significantly slower than normal memory accesses. The TLB comes into play if a threat is started on a different core, where the preceding time slot was used by something other than your process. We are talking about parallel here, after all. I don't know if linux can optimize the TLB flush away when switching between threads of the same process, though.
  • Omnifarious
    Omnifarious about 6 years
    @cmaster - I think there are a lot of 'out-of-the-box' possibilities. I'm not trying to imply that the kernel people are incompetent. I suspect thread creation on Linux is cheaper than NT, for example. No, I just think some really creative thinking applied to this problem might yield some very interesting and impressive results. I suppose saying "they need to make thread creation faster" doesn't really get that across.
  • cmaster - reinstate monica
    cmaster - reinstate monica about 6 years
    Now I think, we are really starting to agree :-) Anyway, I think that current X86 CPUs are not really built for either fast thread switching, nor for fast thread creation. They are built for optimal benchmark performance, which means that anything as costly as thread creation won't really be optimized for because it will not show up in the benchmarks in the first place. It would be interesting to know, what could be achieved if we were to start over with CPU instruction set design - I have a feeling that we would be able to produce much better results in multi-threading and virtualization.
  • Omnifarious
    Omnifarious over 5 years
    @cmaster - One idea would be an instruction to start a thread that took advantage of some sort of 'free core' state maintained in shared memory that could only be accessed or modified via this instruction if the CPU was in user mode. If there weren't any free cores, it would leave a fast path and revert to being a system call. That could make auto-parallelizing loops with only a thousand iterations a performance win.
  • moi
    moi over 2 years
    @Omnifarious your bitbucket link is not available. Can you add your code to GitHub and update the link?
  • Omnifarious
    Omnifarious over 2 years
    @moi - It is on Github. But, as a matter of personal choice, I don't generally link to Github since they don't support Mercurial, and I use Mercurial for all of my personal version control. I'll link to it on OSDN though.