Implementation of a work stealing queue in C/C++?

17,269

Solution 1

No free lunch.

Please take a look the original work stealing paper. This paper is hard to understand. I know that paper contains theoretical proof rather than pseudo code. However, there is simply no such much more simple version than TBB. If any, it won't give optimal performance. Work stealing itself incurs some amount of overhead, so optimizations and tricks are quite important. Especially, dequeues are must be thread-safe. Implementing highly scalable and low-overhead synchronizations are challenging.

I'm really wondering why you need it. I think that proper implementation means something like TBB and Cilk. Again, work stealing is hard to implement.

Solution 2

To implement "work stealing" isn't hard in theory. You need a set of queues containing tasks that do work by doing a combination of computing and generating other tasks to do more work. And you need atomic access to the queues to place newly generated tasks into those queues. Finally, you need a procedure that each task calls at the end, to find more work for the thread that executed the task; that procedure needs to look in work queues to find work.

Most such work-stealing systems make the assumption that there are a small number of threads (backed up typically by real processor cores), and that there is a exactly one work queue per thread. Then you first try to steal work from your own queue, and if it is empty, try to steal from others. What gets tricky is knowing which queues to look in; scanning them serially for work is pretty expensive and can create a huge amount of contention between threads looking for work.

So far this is all pretty generic stuff with one two major exceptions: 1) switching contexts (e.g, setting processor context registers such as a "stack") cannot be stated in pure C or C++. You can resolve this by agreeing to write part of your package in target-platform specific machine code. 2) Atomic access to the queues for a multiprocessor cannot be done purely in C or C++ (ignoring Dekker's algorithm), and so you'll need to code those using assembly language synchronization primitives like the X86 LOCK XCH or Compare and Swap. Now, the code involved in updating the queuse once you have safe access isn't very complex, and you could easily write that in a few lines of C.

However, I think you will find is that attempting to code such a package in C and C++ with mixed assembler is still rather inefficient and you'll eventually end up coding the entire thing in assembler anyway. All's that left are C/C++ compatible entry points :-}

I did this for our PARLANSE parallel programming language, which offers the idea of an arbitrarily large number of parallel computations live and interacting (synchonizing) at any instant. It is implemented behind the scenes on an X86 exactly with one thread per CPU, and the implementation is entirely in assembler. The work-stealing code is probably 1000 lines total, and its tricky code because you want it to be extremely fast in the non-contention case.

The real fly in the ointment for C and C++ is, when you create a task representing work, how much stack space do you assign? Serial C/C++ programs avoid this question by simply overallocating huge amounts (e.g, 10Mb) of one linear stack, and nobody cares much about how much of that stack space is wasted. But if you can create thousands of tasks and have them all live at a particular instant, you can't reasonably allocate 10Mb to each one. So now you either need to determine statically how much stack space a task will need (Turing-hard), or you'll need to allocate stack chunks (e.g., per function call), which widely available C/C++ compilers don't do (e.g, the one you are likely using). THe last way out is to throttle task creation to limit it to a few hundred at any instant, and multiplex a few-hundred really huge stacks among the tasks that are live. You can't do the last if the tasks can interlock/suspend state, because you'll run into your threshold. So you can only do this if the tasks only do computation. That seems like a pretty severe constraint.

For PARLANSE, we built a compiler that allocates activation records on the heap for each function call.

Solution 3

There exist a tool to simply doing it in an very elegant way. It is a really effective way to parrallelize your program in a very short time.

Cilk project

HPC Challenge Award

Our Cilk entry for the HPC Challenge Class 2 award won the 2006 award for ``Best Combination of Elegance and Performance''. The award was made at SC'06 in Tampa on November 14 2006.

Solution 4

If you're looking for a standalone workstealing queue class implementation in C++ built on pthread or boost::thread, good luck, to my knowledge there isn't one.

However, as others have said Cilk, TBB and Microsoft's PPL all have workstealing implementations under the hood.

The question is do you want to use a workstealing queue or implement one? If you just want to use one then the choices above are good starting points simply scheduling a 'task' in any of them will suffice.

As BlueRaja said the task_group & structured_task_group in PPL will do this, also note that these classes are available in the latest version of Intel's TBB as well. The parallel loops (parallel_for, parallel_for_each) are also implemented with workstealing.

If you must look at source rather than use an implementation, TBB is OpenSource and Microsoft ships the sources for its CRT, so you can go spelunking.

You can also look on Joe Duffy's blog for a C# implementation (but it's C# and the memory model is different).

-Rick

Solution 5

This open source library https://github.com/cpp-taskflow/cpp-taskflow supports work stealing thread pools since Dec 2018.

Take a look at the WorkStealingQueue class which implements the work stealing queue as described in the paper "Dynamic Circular Work-stealing Deque," SPAA, 2015.

Share:
17,269
Admin
Author by

Admin

Updated on June 03, 2022

Comments

  • Admin
    Admin almost 2 years

    I'm looking for a proper implementation of a work stealing queue in C/CPP. I've looked around Google but haven't found anything useful.

    Perhaps someone is familiar with a good open-source implementation? (I prefer not to implement the pseudo-code taken from the original academic papers).

  • Phil Miller
    Phil Miller over 14 years
    Or you do the sane thing, and don't allocate space to tasks until they're actually running, and don't think of tasks as things to suspend and resume, but rather to run from execution to completion.
  • Ira Baxter
    Ira Baxter over 14 years
    Your solution isn't sane. If you build complex systems, when one piece of work can call arbitrary other pieces of work, you can't guarantee that your task won't need suspension. You can certainly force this property to be true; you'll then have a hard time building complex systems. We build million-line paralell programs in PARLANSE.
  • Ira Baxter
    Ira Baxter about 10 years
    How well does Linux do with a process with 10,000 threads? Windows bombs out at ~15,000 threads per process. blogs.technet.com/b/markrussinovich/archive/2009/07/08/…. I want to have literally millions of "threads" that individually need to wait on events. PARLANSE can do that. I don't think Linux or Windows OSes are configured to handle a million threads well. I'd expect all kinds of resource troubles, including managing just the thread handles.
  • Dess
    Dess over 5 years
    It never fails: You see 'Ira Baxter' as the author and you just know the post is littered with advertising for some 3rd party program. How this guy not been banned yet for all the shilling is beyond me.
  • PSkocik
    PSkocik over 5 years
    How do you approach the tricky part of "knowing which queues to look in"?
  • Ira Baxter
    Ira Baxter over 5 years
    @PSkocik: You mean "if I'm CPU k, which other queue 1..N do I look in for work to steal?" The awful way if for k to simply scan all the other queues if his is empty. With 4 queues this might be ok, not as attractive with 32-64 queues. A better way that adds some overhead is to keep a bit vector in a single word that tracks which queues have work; it can be updated cheaply with OR and AND. ...
  • Ira Baxter
    Ira Baxter over 5 years
    ... You can make that bit vector accurate if you lock the operations but that makes it expensive to update destroying its purpose. So I do this unsynchronized which means it is only advisory. Still, a pretty good hint where to look first.
  • PSkocik
    PSkocik over 5 years
    Thanks. I rather like these parallel programming hacks. :)
  • Sergey K.
    Sergey K. over 5 years
    This library github.com/cpp-taskflow/cpp-taskflow supports work stealing since Dec 2018.