Writing a (spinning) thread barrier using c++11 atomics

c++ multithreading gcc c++11

17,657

Solution 1

It looks needlessly complicated. Try this simpler version (well, I haven't tested it, I just meditated on it:))) :

#include <atomic>

class spinning_barrier
{
public:
    spinning_barrier (unsigned int n) : n_ (n), nwait_ (0), step_(0) {}

    bool wait ()
    {
        unsigned int step = step_.load ();

        if (nwait_.fetch_add (1) == n_ - 1)
        {
            /* OK, last thread to come.  */
            nwait_.store (0); // XXX: maybe can use relaxed ordering here ??
            step_.fetch_add (1);
            return true;
        }
        else
        {
            /* Run in circles and scream like a little girl.  */
            while (step_.load () == step)
                ;
            return false;
        }
    }

protected:
    /* Number of synchronized threads. */
    const unsigned int n_;

    /* Number of threads currently spinning.  */
    std::atomic<unsigned int> nwait_;

    /* Number of barrier syncronizations completed so far, 
     * it's OK to wrap.  */
    std::atomic<unsigned int> step_;
};

EDIT: @Grizzy, I can't find any errors in your first (C++11) version and I've also run it for like a hundred million syncs with two threads and it completes. I've run it on a dual-socket/quad-core GNU/Linux machine though, so I'm rather inclined to suspect your option 3. - the library (or rather, its port to win32) is not mature enough.

Solution 2

I have no idea if this is going to be of help, but the following snippet from Herb Sutter's implementation of a concurrent queue uses a spinlock based on atomics:

std::atomic<bool> consumerLock;

{   // the critical section
    while (consumerLock.exchange(true)) { }  // this is the spinlock

    // do something useful

    consumerLock = false;  // unlock
}

In fact, the Standard provides a purpose-built type for this construction that is required to have lock-free operations, std::atomic_flag. With that, the critical section would look like this:

std::atomic_flag consumerLock;

{
    // critical section

    while (consumerLock.test_and_set()) { /* spin */ }

    // do stuff

    consumerLock.clear();
}

(You can use acquire and release memory ordering there if you prefer.)

Solution 3

Here is an elegant solution from the book C++ Concurrency in Action: Practical Multithreading.

struct bar_t {
    unsigned const count;
    std::atomic<unsigned> spaces;
    std::atomic<unsigned> generation;
    bar_t(unsigned count_) :
        count(count_), spaces(count_), generation(0)
    {}
    void wait() {
        unsigned const my_generation = generation;
        if (!--spaces) {
            spaces = count;
            ++generation;
        } else {
            while(generation == my_generation);
        }
    }
};

Solution 4

Here is a simple version of mine :

// spinning_mutex.hpp
#include <atomic>


class spinning_mutex
{
private:
    std::atomic<bool> lockVal;
public:
    spinning_mutex() : lockVal(false) { };

    void lock()
    {
        while(lockVal.exchange(true) );
    } 

    void unlock()
    {
        lockVal.store(false);
    }

    bool is_locked()
    {
        return lockVal.load();
    }
};

Usage : (from std::lock_guard example)

#include <thread>
#include <mutex>
#include "spinning_mutex.hpp"

int g_i = 0;
spinning_mutex g_i_mutex;  // protects g_i

void safe_increment()
{
    std::lock_guard<spinning_mutex> lock(g_i_mutex);
    ++g_i;

    // g_i_mutex is automatically released when lock
    // goes out of scope
}

int main()
{
    std::thread t1(safe_increment);
    std::thread t2(safe_increment);

    t1.join();
    t2.join();
}

Solution 5

I know the thread is a little bit old, but since it is still the first google result when searching for a thread barrier using c++11 only, I want to present a solution that gets rid of the busy waiting using the std::condition_variable. Basically it is the solution of chill, but instead of the while loop it is using std::conditional_variable.wait() and std::conditional_variable.notify_all(). In my tests it seems to work fine.

#include <atomic>
#include <condition_variable>
#include <mutex>


class SpinningBarrier
{
    public:
        SpinningBarrier (unsigned int threadCount) :
            threadCnt(threadCount),
            step(0),
            waitCnt(0)
        {}

        bool wait()
        {
            if(waitCnt.fetch_add(1) >= threadCnt - 1)
            {
                std::lock_guard<std::mutex> lock(mutex);
                step += 1;

                condVar.notify_all();
                waitCnt.store(0);
                return true;
            }
            else
            {
                std::unique_lock<std::mutex> lock(mutex);
                unsigned char s = step;

                condVar.wait(lock, [&]{return step == s;});
                return false;
            }
        }
    private:
        const unsigned int threadCnt;
        unsigned char step;

        std::atomic<unsigned int> waitCnt;
        std::condition_variable condVar;
        std::mutex mutex;
};

View more solutions

17,657

Author by

Grizzly

Updated on July 27, 2022

Comments

Grizzly almost 2 years
I'm trying to familiarize myself with c++11 atomics, so I tried writing a barrier class for threads (before someone complains about not using existing classes: this is more for learning/self improvement than due to any real need). my class looks basically as followed:
```
class barrier
{
private:
    std::atomic<int> counter[2];
    std::atomic<int> lock[2];
    std::atomic<int> cur_idx;
    int thread_count;
public:
    //constructors...
    bool wait();
};
```
All members are initialized to zero, except thread_count, which holds the appropriate count. I have implemented the wait function as
```
int idx  = cur_idx.load();
if(lock[idx].load() == 0)
{
    lock[idx].store(1);
}
int val = counter[idx].fetch_add(1);
if(val >= thread_count - 1)
{
    counter[idx].store(0);
    cur_idx.fetch_xor(1);
    lock[idx].store(0);
    return true;
}
while(lock[idx].load() == 1);
return false;
```
However when trying to use it with two threads (thread_count is 2) whe first thread gets in the wait loop just fine, but the second thread doesn't unlock the barrier (it seems it doesn't even get to int val = counter[idx].fetch_add(1);, but I'm not too sure about that. However when I'm using gcc atomic-intrinsics by using volatile int instead of std::atomic<int> and writing wait as followed:
```
int idx = cur_idx;
if(lock[idx] == 0)
{
    __sync_val_compare_and_swap(&lock[idx], 0, 1);
}
int val = __sync_fetch_and_add(&counter[idx], 1);
if(val >= thread_count - 1)
{
    __sync_synchronize();
    counter[idx] = 0;
    cur_idx ^= 1;
    __sync_synchronize();
    lock[idx] = 0;
    __sync_synchronize();
    return true;
}
while(lock[idx] == 1);
return false;
```
it works just fine. From my understanding there shouldn't be any fundamental differences between the two versions (more to the point if anything the second should be less likely to work). So which of the following scenarios applies?
1. I got lucky with the second implementation and my algorithm is crap
2. I didn't fully understand std::atomic and there is a problem with the first variant (but not the second)
3. It should work, but the experimental implementation for c++11 libraries isn't as mature as I have hoped
For the record I'm using 32bit mingw with gcc 4.6.1

The calling code looks like this:
```
spin_barrier b(2);
std::thread t([&b]()->void
{
    std::this_thread::sleep_for(std::chrono::duration<double>(0.1));
    b.wait();
});
b.wait();
t.join();
```
Since mingw doesn't whave <thread> headers jet I use a self written version for that which basically wraps the appropriate pthread functions (before someone asks: yes it works without the barrier, so it shouldn't be a problem with the wrapping) Any insights would be appreciated.

edit: Explanation for the algorithm to make it clearer:
- thread_count is the number of threads which shall wait for the barrier (so if thread_count threads are in the barrier all can leave the barrier).
- lock is set to one when the first (or any) thread enters the barrier.
- counter counts how many threads are inside the barrier and is atomically incremented once for each thread
- if counter>=thread_count all threads are inside the barrier so counter and lock are reset to zero
- otherwise the thread waits for the lock to become zero
- in the next use of the barrier different variables (counter, lock) are used ensure there are no problems if threads are still waiting on the first use of the barrier (e.g. they had been preempted when the barrier is lifted)
edit2: I have now tested it using gcc 4.5.1 under linux, where both versions seem to work just fine, which seems to point to a problem with mingw's std::atomic, but I'm still not completely convinced, since looking into the <atomic> header revaled that most functions simply call the appropriate gcc-atomic meaning there really shouldn't bea difference between the two versions
sehe over 12 years

oh - so much more peaceful on the eyes. Fewer than 6 -> and __ per square inch is a blessing. And, you can have whitespace too! I'll think I'll go and clean up that OP
Kerrek SB over 12 years

Thanks :-) The actual queue is actually pretty neat, too. It's on Dr Dobbs, Measuring Parallel Performance: Optimizing a Concurrent Queue.
sehe over 12 years

I've seen it! One of those warning posts (stay away from hand-rolled lock-free unless...). I prefer using TBB or LibCds (PS. formatted OP)
Kerrek SB over 12 years

@sehe: But what choice is there other than hand-rolled? I saw LibCds, is it any good? Doxygen makes my skin crawl... There's some guy who made "Boost lockfree", which is not part of Boost, but again no idea how good it is. Is TBB general C++, or only for Intel compiler?
sehe over 12 years

LibCds is pretty darn good. It has maps, stacks, queues, order lists with a number of (generic pluggable) garbage collection strategies. The threading library can be plugged (i use pthreads) and the most important reason why it isn't in boost, is potential licensing issues with the algorithms, IIRC. Edit Following the link to Boost lockfree, I remember seeing some conversation between the developer of LibCds and Tim Blechmann about implementation specifics and possible inclusion in boost.
sehe over 12 years

On TBB: The project is dedicated to supporting all compilers, all OSes and all processors and the current Do I have to use Intel’s compilers?
Kerrek SB over 12 years

I didn't realize that maps even exist. I think there was some theoretical paper or so, but there was some fundamental problem with it (a tree, that is). I think that someone proved that its impossible without garbage collection.
sehe over 12 years

let us continue this discussion in chat
Grizzly over 12 years

unfortunately since I'm trying to understand why my specific algorithm doesn't work that doesn't help at all
Kerrek SB over 12 years

@Grizzly: Yeah, sorry. I can't say I see through your algorithm just yet; I just vaguely thought this might be interesting. Sorry!
sehe over 12 years

nice work - though I never think of screaming girls while meditating
chill over 12 years

@sehe, oh noes, he did it! He messed with my holy GNU formatting! :P
sehe over 12 years

Huh. I read lots of GNU code, I'm pretty sure I haven't seen those half-indents. Now, it could be an SO artifact, or just my sense of direction when making my way through GNU code :) Ok, I'll be more conservative next time
Grizzly over 12 years

I woukdn't say my version is that muc more complecated, but that version does look nicer, so thanks for that. Unfortunately it seems to have the same problems as my original version (just did a quick test, need to verify some more to be sure)
Grizzly over 12 years

@sehe: libCds looks pretty intresting, however from looking over code/documentation it seems to be pretty much specialized for linux/gcc and windows/msvc. Any information about using/compiling with mingw on windows or gcc on mac? (aka have I missed something which enables that or is it possible to do without too much work)
sehe over 12 years

@Grizzly I haven't done it, but I don't see what the issue would be, really. You could try (or I could, but that has O(infinity) time complexity in the worst case
Grizzly over 12 years

@sehe: from what I have seen it won't work without changes since at the very least os detection in compiler/gcc/defs.h will not recognize either os and somewhere it seemed to make decisions based on os first, compiler second (so no gcc path on win). also the makefile seems to check the os depending on the compiler (so no recognization of windows for gcc). While these issues are not to bad to fix, those are only isses a quick grep trevealed. I might still try it, but I doubt it would be straightforward, so I simply hoped there might be existing experience
Nathan Binkert over 11 years

spinning_locker can likely be replaced by std::lock_guard
Mikhail almost 11 years

This is fine but I think the point of the thread was the spinlock.
IneQuation over 10 years

@ali_bahoo your implementation is almost correct. Almost, because here: bool expected = false; while(!lockVal.compare_exchange_strong( expected,true ) ); after the first call to compare_exchange_strong, the value of expected becomes true, and on the subsequent spins, the lock will let any thread through. Check out the std::atomic<> reference. Correct way would be: bool expected = false; while(!lockVal.compare_exchange_strong( expected,true ) ) expected = false;
ali_bahoo over 10 years

@IneQuation: No. expected is not a class member. Each thread entering the Lock() method will face a new expected variable. It is just a value on the stack, that will be cleaned up when exiting the Lock() method. So you do not have to set it to false.
IneQuation over 10 years

No, @ali_bahoo, you are wrong. :) You may also test your code, it doesn't work as intended, and this is because the value of expected on the stack is being overwritten by compare_exchange_strong. The reference I linked says: "Compares the contents of the contained value with expected: - if true, it replaces the contained value with val (like store). - if false, it replaces expected with the contained value ." This breaks expected for the next loop iteration.
ali_bahoo over 10 years

@IneQuation expected is just a variable, just a non-const reference parameter to compare_exchange_strong function. The value of LockVal is important. compare_exchange_strong compares the value of 'lockVal' if it is false. If not, it spins, until the lock is unlocked.
IneQuation over 10 years

Sorry, but no, you are wrong - contents of expected are changed if the comparison fails, as the reference I quoted clearly states. I only wrote here because I have actually tried using your code and it failed, I started reading up and arrived to the conclusions I've already stated above. Take it or leave it, I'm not here to argue on the internet just for the argument's sake.
ali_bahoo over 10 years

@IneQuation God! What was I thinking? For this purpose, better approach is to use std::atomic::exchange() instead of compare_exchange_strong. It is simpler.
Maxim Egorushkin almost 10 years

This is a critical section though, not a barrier. And a poor one because it does not use CPU pause instruction.
Kerrek SB almost 10 years

@MaximYegorushkin: Yeah, that's true, it's not a barrier. There's no "pause" in C++, unfortunately. Maybe this_thread::yield might work, but that seems wrong (we don't want to deschedule, just spin efficiently.) Bear in mind that the OP doesn't want to use a ready-made implementation.
Maxim Egorushkin almost 10 years

There's no "pause" in C++, unfortunately. - I find such answers akin to limiting belief systems. If that were true, libraries such as TBB and many useful C++ applications would not exist. Just because something is outside of scope of the C++ standard does not make it impossible to use.
mchiasson over 9 years

It's very close to my simple version of mine. But instead of using std::atomic<bool>, I use std::atomic_flag with std::atomic_flag::test_and_set(std::memory_order_acquire) for acquiring and std::atomic_flag::clear(std::memory_order_release) for releasing. This technique will give you a pure lockless barrier.
mchiasson about 9 years

spinlocks are typically lockless (free of mutex)
eyeshield21 over 8 years

Yes, nice. This is hidden away round about page 267 for anyone wanting more information as barrier does not appear in the index of my Manning 2012 copy.
neocpp almost 8 years

I know this is a bit dated, but is this really wait-free? I thought wait-free meant that any thread completes its operation in at most a bounded number of steps regardless of what other threads are doing, which is violated if any sort of lock is involved (spinlock or otherwise). The linked article in the comments doesn't seem to mention wait-free at all, although there is a another article that looks at a lock-free implementation (I ran out of views though, so I can't actually read that one).
Kerrek SB almost 8 years

@neocpp: No, the queue is neither wait-free, nor even lock-free, because of the spinlock. The article was fairly cavalier about those terms; let me remove them from the answer. Note that instead of an atomic<bool> you could/should use std::atomic_flag, which is required to have lock-free operations (unlike atomic<bool>).
neocpp almost 8 years

Upvoted with the update, the wait-free implementation with a spinlock looked like an oxymoron to me so I just wanted to make sure I wasn't missing anything.