Circular lock-free buffer

c++ algorithm multithreading concurrency lock-free

64,798

Solution 1

I've made a particular study of lock-free data structures in the last couple of years. I've read most of the papers in the field (there's only about fourty or so - although only about ten or fifteen are any real use :-)

AFAIK, a lock-free circular buffer has not been invented. The problem will be dealing with the complex condition where a reader overtakes a writer or vis-versa.

If you have not spent at least six months studying lock-free data structures, do not attempt to write one yourself. You will get it wrong and it may not be obvious to you that errors exist, until your code fails, after deployment, on new platforms.

I believe however there is a solution to your requirement.

You should pair a lock-free queue with a lock-free free-list.

The free-list will give you pre-allocation and so obviate the (fiscally expensive) requirement for a lock-free allocator; when the free-list is empty, you replicate the behaviour of a circular buffer by instantly dequeuing an element from the queue and using that instead.

(Of course, in a lock-based circular buffer, once the lock is obtained, obtaining an element is very quick - basically just a pointer dereference - but you won't get that in any lock-free algorithm; they often have to go well out of their way to do things; the overhead of failing a free-list pop followed by a dequeue is on a par with the amount of work any lock-free algorithm will need to be doing).

Michael and Scott developed a really good lock-free queue back in 1996. A link below will give you enough details to track down the PDF of their paper; Michael and Scott, FIFO

A lock-free free-list is the simplest lock-free algorithm and in fact I don't think I've seen an actual paper for it.

Solution 2

The term of art for what you want is a lock-free queue. There's an excellent set of notes with links to code and papers by Ross Bencina. The guy whose work I trust the most is Maurice Herlihy (for Americans, he pronounces his first name like "Morris").

Solution 3

The requirement that producers or consumers block if the buffer is empty or full suggests that you should use a normal locking data structure, with semaphores or condition variables to make the producers and consumers block until data is available. Lock-free code generally doesn't block on such conditions - it spins or abandons operations that can't be done instead of blocking using the OS. (If you can afford to wait until another thread produces or consumes data, then why is waiting on a lock for another thread to finish updating the data structure any worse?)

On (x86/x64) Linux, intra-thread synchronization using mutexes is reasonably cheap if there is no contention. Concentrate on minimizing the time that the producers and consumers need to hold onto their locks. Given that you've said that you only care about the last N recorded data points, I think a circular buffer would be do this reasonably well. However, I don't really understand how this fits in with the blocking requirement and the idea of consumers actually consuming (removing) the data they read. (Do you want consumers to only look at the last N data points, and not remove them? Do you want producers to not care if consumers can't keep up, and just overwrite old data?)

Also, as Zan Lynx commented, you can aggregate/buffer up your data into bigger chunks when you've got lots of it coming in. You could buffer up a fixed number of points, or all the data received within a certain amount of time. This means that there will be fewer synchronization operations. It does introduce latency, though, but if you're not using real-time Linux, then you'll have to deal with that to an extent anyway.

Solution 4

The implementation in the boost library is worth considering. It's easy to use and fairly high performance. I wrote a test & ran it on a quad core i7 laptop (8 threads) and get ~4M enqueue/dequeue operations a second. Another implementation not mentioned so far is the MPMC queue at http://moodycamel.com/blog/2014/detailed-design-of-a-lock-free-queue. I have done some simple testing with this implementation on the same laptop with 32 producers and 32 consumers. It is, as advertised, faster that the boost lockless queue.

As most of the other answers state lockless programming is hard. Most implementations will have hard to detect corner cases that take a lot of testing & debugging to fix. These are typically fixed with careful placement of memory barriers in the code. You will also find proofs of correctness published in many of the academic articles. I prefer testing these implementations with a brute force tool. Any lockless algorithm you plan on using in production should be checked for correctness using a tool like http://research.microsoft.com/en-us/um/people/lamport/tla/tla.html.

Solution 5

There is a pretty good series of articles about this on DDJ. As a sign of how difficult this stuff can be, it's a correction on an earlier article that got it wrong. Make sure you understand the mistakes before you roll your own )-;

View more solutions

64,798

Shing Yip

Updated on July 05, 2022

Comments

Shing Yip almost 2 years
I'm in the process of designing a system which connects to one or more stream of data feeds and do some analysis on the data than trigger events based on the result. In a typical multi-threaded producer/consumer setup, I will have multiple producer threads putting data into a queue, and multiple consumer threads reading the data, and the consumers are only interested in the latest data point plus n number of points. The producer threads will have to block if slow consumer can not keep up, and of course consumer threads will block when there are no unprocessed updates. Using a typical concurrent queue with reader/writer lock will work nicely but the rate of data coming in could be huge, so i wanted to reduce my locking overhead especially writer locks for the producers. I think a circular lock-free buffer is what I needed.

Now two questions:
1. Is circular lock-free buffer the answer?
2. If so, before i roll my own, do you know any public implementation that will fit my need?
Any pointers in implementing a circular lock-free buffer are always welcome.

BTW, doing this in C++ on Linux.

Some additional info:

The response time is critical for my system. Ideally the consumer threads will want to see any updates coming in as soon as possible because an extra 1 millisecond delay could make the system worthless, or worth a lot less.

The design idea I'm leaning toward is a semi-lock-free circular buffer where the producer thread put data in the buffer as fast as it can, let's call the head of the buffer A, without blocking unless the buffer is full, when A meets the end of buffer Z. Consumer threads will each hold two pointers to the circular buffer, P and P_n, where P is the thread's local buffer head, and P_n is nth item after P. Each consumer thread will advance its P and P_n once it finish processing current P and the end of buffer pointer Z is advanced with the slowest P_n. When P catch up to A, which means no more new update to process, the consumer spins and do busy wait for A to advance again. If consumer thread spin for too long, it can be put to sleep and wait for a condition variable, but i'm okay with consumer taking up CPU cycle waiting for update because that does not increase my latency (I'll have more CPU cores than threads). Imagine you have a circular track, and the producer is running in front of a bunch of consumers, the key is to tune the system so that the producer is usually runing just a few step ahead of the consumers, and most of these operation can be done using lock-free techniques. I understand getting the details of the implementation right is not easy...okay, very hard, that's why I want to learn from others' mistakes before making a few of my own.
- Dave almost 15 years
  
  I think it would be helpful if you sketch the API that you want this data structure to implement.
- Zan Lynx almost 15 years
  
  Something I learned is take big chunks of work. I don't know the size of your work items, but you can increase efficiency if you can produce bigger chunks and consume bigger chunks. You can also increase it by consuming variable size chunks so the consumers don't all finish at once and contend on the data queue.
- Zan Lynx almost 15 years
  
  Another thing to think of is if you need one buffer or a series of buffers. You could have producer/consumer pairs sharing one buffer, and when a buffer is full, the producer or consumer temporarily switches to another open buffer. It's a form of work stealing.
- ChrisW almost 15 years
  
  @Dave the api is presumably just one enqueue method plus one dequeue method ... both of them being synchronous/blocking methods, i.e. dequeue should block if there were nothing in the queue, and enqueue block if the circular buffer were full.
- Dave almost 15 years
  
  Efficient lock-free algorithms are unique snowflakes whose discovery usually merits a research article. I'm not going to attempt to answer this question until OP distinguishes his actual requirements from what he thinks a solution should look like.
- Shing Yip almost 15 years
  
  Zan Lynx: Yes, bundling up work can reduce lock overhead. I have that in my previous systems. I also have dynamic bundling size base on work load. It worked pretty well, but this time the data rate is too fast for my old system to handle, that's why i have to rethink the whole thing.
- Doug almost 15 years
  
  A millisecond is a very fast timing deadline on unmodified Linux. If another process gets to run, then you could easily miss it. You would need to use real-time priorities, and even then I'm not sure that you can reliably meet those deadlines. Are you sure that you need to be that responsive? Can you make just the producers that fast, implement them in e.g. a device driver, and relax the requirements on the consumers?
- Shing Yip almost 15 years
  
  Doug, what i meant to say was 1 extra millisecond will make a big difference. I have not done any profiling to see if other system process maybe causing latency on my system, but i think on average preempting by system process isn't going have any significant impact.
- Tomas almost 13 years
  
  why don't you just use semaphores? No locking/blocking, consumer only goes sleep when the buffer is empty, producer when the buffer is full. What's wrong with that?
- Jeremy Friesner over 8 years
  
  If you really need hard-real-time behavior, it's probably worthwhile looking in to running the program as a hard-real-time task on a hard-real-time OS. For a Linux-friendly version, check out Xenomai, it will let you run hard-real-time and regular (soft-real-time/non-real-time) processes simultaneously in the same environment.
Shing Yip almost 15 years

I have also considered the boss/worker threading model where the boss thread multicast updates to worker threads' private queues. I think this is more/less the direction you're going. I have to give it more though, but when i was considering it, boss/worker seemed to have too much overhead because all worker must get the same updates.
Shing Yip almost 15 years

But business logic requires all worker thread to know all data that is streaming in. There is one and only one type of data coming in and every data point is equally important, so i can't really slice my incoming stream and have different data in different queues. Cashing on the producer side and bundling updates to the data model to prevent hogging has been done and it wasn't enough to handle the load.
Nikolai Fetissov almost 15 years

How big is the input domain? If it's something like market data in financial world, you have limited, though big, number of items and only few types of updates. Are workers reactive to the input events or do they do their own processing and only poll for your input when necessary?
Nikolai Fetissov almost 15 years

Would you post a link to analysis of Sutter's queue?
rama-jka toti almost 15 years

it's all on DDJ and one of the guys following his blogs profiled it .. The point is that hot CAS isn't required for many scenarios and that you can beat that kind of fine granularity any day even with simple swaps.
Shing Yip almost 15 years

It is something like market data in the financial world. Workers do their own processing and they have random access to n numbers of update history when needed (n is a configurable number but will not change during the lifetime of the process). I would like to design a system that works well on both large and small n so I can have one code base.
rama-jka toti almost 15 years

That's it, thanks.. But I believe it still might have some races. Heck, anything out there as sensitive as expecting implicit barriers or specific or full coherency-understanding is a problem waiting to happen in production. I just don't believe that level of detail solves so much as focus on application-level design rather than low-level plumbing when/only-if it makes sense/is-identified to be so. I applaud the effort, the books, all of it; but its just articles on a touch topic even MS is struggling to do well for the mass-market PFX crowd.
rama-jka toti almost 15 years

Just an opinion there is always more important work to do than looking into plumbing. Parallel efforts ripple across the board not just queues, or indeed mid-1990s threading DDJ articles keep reinventing; that is, from NT to later Solaris and Unix adopting similar techniques or latest work on C++. The latter is and probably will take ages to complete and still fight the fact no clean OO-way to post P2-Pro-like out-of-order universe is sensible..
Nikolai Fetissov almost 15 years

Well, you can hand off n-tuples to consumers with swap of a pointer to it (you'd need to worry about memory fences, etc. - ordered atomic) but then you'll get into memory management issues like with hazard pointers.
Admin almost 15 years

A queue is not a circular buffer.
Admin almost 15 years

Atomic operations are indeed slow, in and of themselves. What makes them useful is that they scale.
Norman Ramsey almost 15 years

@Blank Xavier: No, but a circular buffer is a queue. The problem calls for a queue. And the most efficient implementation of a queue is as... (wait for it) a circular buffer. In any case, if you want to search, you search for 'lock-free queue', not 'lock-free circular buffer'.
supercat over 13 years

@Blank Xavier: The Michael and Scott FIFO looks a lot like one I independently implemented in .net; it didn't seem hard. If the .net run-time and garbage-collector didn't guarantee that objects will never be recycled while references exist, it would have been hard to prevent the ABA problem (the Michael and Scott paper linked above didn't mention this) but the .net garbage collector solves that problem automatically. Out of curiosity, how did you solve the ABA problem?
Admin over 13 years

@Supercat: the M&S paper explicitly solves the ABA problem, by using pointer-counter pairs; "structure pointer_t {ptr: pointer to node_t, count: unsigned integer}". The counter is incremented when necessary, so that it is almost impossible for ABA to occur.
supercat over 13 years

@Blank Xavier: Ah, I somehow missed that. Their CAS seems to assume the existence of an instruction to swap a two-item struct; I'm unavailable of any such feature in .net.
Admin over 13 years

@Supercat: I'm surprised. Double-word CAS is available in the Win32 API. If you implement hazard pointers, I think you can still use their algorithm. I'll be trying that out soon enough.
supercat over 13 years

@Blank Xavier: Double-word CAS is exposed for arguments of type 'Long' only. I can think of three reasons: (1) A struct containing two 32-bit items may sit on an odd 32-bit boundary; (2) the most useful scenarios would involve pairing an object reference with something, and for some reason object references grow to 64 bits in x64 (I'd have left them as 32 bits, since 2+ billion objects would have been a lot more than 32-bit Windows can handle); (3) I'm not sure hardware will serialize a DCAS that occurs simutaneously with a CAS on half of it.
Admin over 13 years

In Win32, there are Interlocked*() API function calls for everything except 64-bit DCAS. 64-bit DCAS (e.g. 128 bits atomically CASed) has a different type of prototype, starting with an underscore, which may be why it's not obviously present. I don't -know- if DCAS serialises correctly with CAS, but I think it does - the basic mechanism is the lock the given cache lines, so DCAS will try to lock two cache lines whereas CAS will lock one - DCAS will notice if someone else has locked the cache line it is trying to obtain.
Admin over 13 years

If you fully pre-allocate your elements, you do not stress the allocator.
Tomas almost 13 years

I don't see any reason why just not to use semaphores? No locking/blocking, consumer only goes sleep when the buffer is empty, producer when the buffer is full. What's wrong with that? How could some lock-free queue be better than this?
Tomas almost 13 years

absolutely agree with the first paragraph. Don't see any reason for not using semaphore here.
Norman Ramsey over 12 years

@Tomas: the lock-free queue can be better because there's no single lock to act as a performance bottleneck. OP was specifically concerned about reducing lock overhead when contention is very high. Semaphores don't help with contention. A lock-free queue does.
Admin almost 12 years

Note that a lock-free fixed-length (array based, in fact) circular buffer, for a single reader and single writer, where the buffer being full leads to failure to write, exists and has existed for a long time.
supercat almost 12 years

I've used such single-reader/single-writer queues often on embedded systems. No special atomic primitives are required except for "volatile" variables, since since every storage location will be "owned" by by either the reader or the writer, and neither the reader and writer will ever allow the other to see an illegitimate state.
Admin almost 12 years

I wonder if volatile is enough? could there a re-ordering issue? might memory barriers be necessary?
supercat almost 12 years

Different compilers and implementations do different things with volatile variables. What is necessary here is mainly that if thread 1 writes two volatile variables X and Y, in that order, and if thread 2 reads Y then X, then if thread 2 sees the new value of Y, it must also see the new value of X. On the embedded systems where I've done such things, volatile variables provide such guarantees (not hard when there's only one processor). On multi-core systems, I believe most compilers will generate memory barriers sufficient to guarantee the above semantics with volatile variables...
supercat almost 12 years

...although using volatile variables would likely also generate additional unnecessary (but costly) memory barriers as well. Personally, I suspect that as processors get more and more cores, it will become necessary to have memory-synchronization primitives that can operate at a finer level than memory barriers (whose cost increases with the number of cores).
Admin almost 12 years

As I understand it, volatile won't be needed on a single core system - there are no external sources changing the values in the variables; only threads on the same single core.
Admin almost 12 years

On multi-core systems, I know recent MSVCs will for reading a volatile before a read memory barrier and for writing a write, but that I think is totally compiler dependent. I'd write explict memory barriers (hopefully the compiler will remove any duplicates!)
supercat almost 12 years

Even on a single-core system, at least in C, volatile is needed to ensure that the compiler will not attempt to reorder or eliminate reads and writes. For example, if SPI_* are not volatile, for(i=0;i<16;i++){SPI_DAT = tx[i]; temp=16000; do {} while(!SPI_RDY && --temp); if (!temp) break; rx[i] = SPI_DAT;} could be rewritten (assuming neither temp nor i was used further along in the code) as if (SPI_RDY) {memcpy(rx,tx,16); SPI_DAT = tx[15];} else SPI_DAT = tx[0];. The latter code would be much more "efficient", but wouldn't be very effective at reading or writing SPI data.
Admin almost 12 years

I may be wrong, but I think unless the volatile is performing a memory barrier, it will not have any impact on the re-ordering behaviour of the compiler or CPU. All it will do is ensure its value is read from memory rather than a register or cache.
LanDenLabs over 8 years

Dennis' site has moved to landenlabs.com/code/ring/ring.html - It has a lock free ring buffer.
Solomon Slow about 7 years

@TMS, An application that has dedicated producer threads and dedicated consumer threads (e.g., as described by the OP) can never be called "lock-free." In a lock-free application with N threads, you should be able to indefinitely suspend any combination of (N-1) or fewer of the threads, and the application should still continue to make progress. But a dedicated consumer thread can not indefinitely make progress if all of the producers are suspended, and if dropping the "product" on the floor is not allowed, then a producer can not make progress if none of the consumers is allowed to run.
Peter Cordes over 6 years

IIRC, volatile memory accesses won't be reordered with other volatile accesses. But in ISO C, that's all you get. In MSVC, volatile goes way beyond that, but these days you should just use std::atomic with memory_order_release or seq_cst or whatever you want.
Peter Cordes over 6 years

@jameslarge: "lock-free" can describe an algorithm or data structure (like a queue). en.wikipedia.org/wiki/Non-blocking_algorithm It's not usually applied to a whole application. Suspending all the producer threads simply means the queue will empty. But in a lock-free queue, suspending any one or more threads at any time must not stop the other threads from being able to enqueue & dequeue. (Even this is not easy to achieve; efficient implementations will often have a thread "claim" a slot: stackoverflow.com/questions/45907210/…)
Peter Cordes over 6 years

@Doug: Yes, a consumer/producer should sleep if they find the queue empty/full. But no, it doesn't always mean you should implement the queue with traditional locking, especially not with one big lock for the whole data structure. This queue (same as previous comment) is not "lock-free" in the technical sense, but it does allow producers to be totally independent from consumers (no contention), giving potentially better throughput than you could get locking. Good point about efficient wake on empty->non-empty though.
Peter Cordes over 6 years

liblfds.org has a well-regarded multi-producer/multi-consumer circular buffer queue. It's not technically lock-free: a producer that sleeps in the middle of adding an entry can block consumers from seeing anything from other producers. See stackoverflow.com/questions/45907210/… for an analysis of its progress guarantees. It's very low-contention (none between a producer and a consumer if it's not empty or full), and may be what you want in practice..
Peter Cordes over 6 years

Is it truly lock-free, or only if a producer / consumer doesn't sleep after claiming a slot but before finishing the enqueue/dequeue? See this analysis of the multi-producer multi-consumer queue in liblfds.org, which probably works pretty similarly. In practice it works great with low contention, but it's not technically lock-free. Anyway, upvoted because bulk/burst mode sounds like a good idea.
Peter Cordes over 6 years

Use a power of 2 for your size, so the % (modulo) is just a bit bitwise AND. Also, storing a sequence number in your slots would reduce contention between producer and consumer. In this, the producer has to read the write position, and vice-versa, so the cache line containing those atomic variables ping-pongs between cores. See stackoverflow.com/questions/45907210/… for a slot sequence-number way. (It's a multi-producer multi-consumer queue, and could be greatly simplified to a single producer/consumer queue like this.)
Peter Cordes over 6 years

I'm pretty sure many of the loads/stores only need memory_order_acquire or release, not the default seq_cst. This is a big difference on x86, where seq_cst stores need mfence (or xchg), but release stores are just plain x86 stores. StoreLoad barriers are the most expensive barrier on most other architectures, too. (preshing.com/20120710/…)
Peter Cordes over 6 years

It would probably be better to put read after buffer in the class layout, so it's in a different cache line from write. So the two threads will only be reading cache lines written by the other, rather than both writing into the same cache line. Also, they should be size_t: there's no point having 64-bit counters with 32-bit pointers. And an unsigned type makes modulo much more efficient (godbolt.org/g/HMVL5C). Even uint32_t would be reasonable for almost all uses. It would probably be best to template this on size, or dynamically allocate the buffer.
Peter Cordes over 6 years

@user1959190: that lock-free implementation uses volatile size_t m_rIndex, m_wIndx; instead of C++11 std::atomic for the indices, but it looks to me like it depends on acquire-load / release-store behaviour from it (e.g. other threads have to see the m_buffer[m_wIndex] = value store before they see m_wIndex = Next(m_wIndex)). So it happens to work on x86, but will break on ARM/PowerPC/whatever. It's also inefficient, because instead of loading from the volatile into a local variable, it keeps re-referencing the volatile value in the Get() and Put() functions.
Saman Barghi over 6 years

I agree, it does not guarantee termination-safety and according to [1024cores] (1024cores.net/home/lock-free-algorithms/introduction), it is a blocking algorithm and system might not make forward progress. But it becomes wait-free in SPSC mode. I edit the answer to reflect this.
Saman Barghi over 6 years

Also take a look at Dmitry's implementation of Bounded MPMC queue: 1024cores.net/home/lock-free-algorithms/queues/…. Again this is not lock-free but is lockless, and is very simple and efficient. However, in terms of performance the DPDK's queue in bulk/burst mode can reach up to hundred millions of operations per second in terms of throughput depending on the batch size. The combination of atomic operations and sequential read/write makes it very efficient.
ed9w2in6 about 5 years

@PeterCordes why modulo is more efficient with power of 2?
Peter Cordes about 5 years

@BaiyanHuang: Because computers use binary integers, so you just keep the low n bits with an AND. e.g. x % 8 = x & 7, and bitwise AND is much cheaper than div, or even tricks you can do with compile-time-constant divisors.
Victor Polevoy almost 4 years

LDD3? Linux Device Drivers 3?
U. Windl over 3 years

@user82238 A circular buffer is an implementation of a queue with bounded capacity, right?
yyny about 3 years

@PeterCordes Signed integers are generally faster for most operations, if your modulo is a constant power of two, you can use a value & (power - 1) to get the same optimized modulo operation as unsigned integers but still allow the compiler to perform extra optimizations because of the explicitly undefined behavior of edge cases in signed operations.
Peter Cordes about 3 years

@yyny - for most operations they're the same speed. Do you have a specific ISA in mind for your performance claim? On x86-64, one of the differences is that zero-extension from 32 to 64-bit is often free in asm, happening implicitly every time you write a 32-bit register. Sign-extension from int to pointer-width requires a special instruction. I know some other ISAs have different standard calling conventions (e.g. MIPS64 keeps values sign-extended to 64-bit on pass/return, so receiving a signed arg can be cheaper for a non-inlined function.) Signed overflow UB allows some optimizations.