Recently, I began working on a primitive graphics library written in C++ (source code here) that can draw, erase, and move shapes on the screen. It's a simple project, but I really enjoyed designing both the frontend API and the backend systems. One of my earliest design decisions, how the input handler interacted with the render engine, was one that piqued my interest and led me down a rabbit hole of concurrent system design. But first, let's look at the problem at hand.
The Queue
A queue is the natural answer here: the input handler pushes events, and the render engine drains them. Previously, I built a MPMC (multi-producer, multi-consumer) work queue for my particle simulator that was protected by a mutex and condition variable. This approach was incredibly effective given that the problem involved genuine contention with multiple threads racing to acquire a shared resource. In this problem, however, with exactly one producer thread and one consumer thread, that contention is largely absent, making a mutex an unnecessarily heavy primitive. This realization naturally pointed me towards a concurrent data structure without mutexes, which sent me down a long-winded journey of learning the secret sauce to the black magic that is lock-free programming.
I found this CppCon talk by Charles Frasch, which turned out to be exactly what I was looking for and acted as a great stepping stone in this process. The discussion outlined a single-producer, single-consumer (SPSC) concurrent queue that is free of any lock-based structure, operating solely with atomic variables. At its core, the underlying data structure at hand is a ring buffer, or a fixed-capacity queue that maintains separate read and write indices into a contiguous block of memory. The read and write indices are managed using atomic variables to avoid the numerous possible race conditions that can occur otherwise with partially observed index updates.
Why Lock-Free?
Why even consider a lock-free approach? Mutexes are a tried and true way to combat data races in shared resources. However, even in the absence of heavy contention, acquiring and releasing a mutex inherently carries overhead. This is either a kernel transition or a futex operation on every push and pop, regardless of whether another thread is actually waiting. With only one producer and one consumer, we can mitigate this per-operation cost entirely by coordinating directly through atomics.
Sequential Consistency
We will begin by enforcing sequentially-consistent memory ordering, where all threads agree on a single global order of atomic operations. This is the strongest guarantee available, but only needed in extreme cases. However, it is a good starting point regardless. To do this, we can use the following structure for pushing and popping elements, assuming that the queue has a capacity of some power of two, allowing for efficient modulo calculation via a bitwise and (read more about this here). Note that pop does not return value as, in our case, the queued item represents a command that takes effect upon being dequeued.
#include <atomic>
#include <cstddef>
#include <cstring>
std::atomic_size_t writeIdx_{}, readIdx_{};
T* buff_;
std::size_t capacity_; // power of two capacity
std::size_t mask = capacity_ - 1;
bool push(const T& val) noexcept {
std::size_t writeIdx = writeIdx_.load(std::memory_order_seq_cst);
std::size_t readIdx = readIdx_.load(std::memory_order_seq_cst);
if (writeIdx - readIdx == capacity_) {
return false;
}
std::memcpy(&buff_[writeIdx & mask], &val, sizeof(T));
writeIdx_.store(writeIdx + 1, std::memory_order_seq_cst);
return true;
}
bool pop() noexcept {
std::size_t readIdx = readIdx_.load(std::memory_order_seq_cst);
std::size_t writeIdx = writeIdx_.load(std::memory_order_seq_cst);
if (readIdx == writeIdx) {
return false;
}
// destruct object if necessary
readIdx_.store(readIdx + 1, std::memory_order_seq_cst);
return true;
}This, naturally, is fairly slow and unoptimized. On my ARM machine, sequentially
consistent memory operations use the ldar and stlr assembly instructions,
which prevent the CPU from reordering any surrounding memory operations, even
those that are unrelated, to ensure that any prior writes are globally visible
before any subsequent operation proceeds. We will see shortly, however, that
this approach is overkill. To illustrate this, I created a file to test
throughput on a basic example, where two threads push and pop elements from the
queue and spin when there is no work to be done to simulate contention. I stored
8-bit unsigned intergers in the queue for simplicity. Across operations
per thread, the throughput results were as follows. Note that I compiled with
the -O3 flag.
Although this test may not be perfect, it demonstrates the queue's performance in a sustained load situation. A natural follow-up is: Can we do any better?
Acquire-Release Semantics
The answer is yes. Sequential consistency forces all threads to agree on a single global order of every atomic operation which is a much stronger guarantee than we actually need. In our SPSC queue, the only ordering that matters is between the producer and consumer threads where the consumer must not read a slot until the producer's write to that slot is visible. Acquire-release semantics guarantee exactly this.
The mental model is similar to a lock: when you release a mutex, every write you
made while holding it becomes visible to whoever acquires the mutex next. In the
context of acquire-release, a store with release semantics ensures that all
preceding writes, including non-atomic ones (e.g. our call to memcpy in the
snippet above), are visible to any thread that subsequently performs an
acquire-load of the same atomic variable and observes the stored value.
Critically, this prevents the compiler and CPU from reordering operations across
the barrier (for more, see
here).
Without release semantics, the compiler may reorder the store to writeIdx_
ahead of the memcpy, meaning that the consumer could see an updated index and
read from a slot with incomplete data. The benefit of acquire-release is it
provides the same fence-like guarantee that mutexes provide without the costly
syscalls and context switches.
Now, we can apply these semantics to our queue. It is important to note that
both the producer and consumer threads both "own" an index in the sense that
they each increment one: the producer increments the write index, and the
consumer increments the read index. When either thread increments its own index
after completing an operation, the store uses release semantics. This ensures
that the preceding read or write to the buffer slot is visible before the
updated index is published. The load of the other thread's index, however, uses
acquire semantics which ensures that we see all writes that preceded its
release-store. When a thread loads its own index, there is no other writer to
synchronize with, so we can further relax the memory constraints with which we
load the variable. In C++, relaxed memory ordering reports the variable as-is,
meaning we may obtain a stale value if the other thread increments it. However,
this does not break our implementation: since the indices only increment, a
stale read acts as a lower bound of the given index. The queue may appear full
or empty when it isn't, but we can never pop from a truly empty queue or push to
a truly full one. On my ARM machine, this distinction matters at the instruction
level: relaxed operations compile to plain register load and store instructions
(ldr and str), and, as of ARMv8.3, acquire-loads use ldapr: a
lighter-weight alternative to the ldar instruction that sequential consistency
utilizes.
Changing the memory orderings in this way leads to the following results:
The question remains: can we do better?
False Sharing
Again, the answer is yes. To see how, we must understand how our queue struct is laid out in memory. The member variables are placed contiguously, in the order they were declared. If the two atomic indices end up on the same cache line, which, at only 8 bytes each, is highly likely, then every time a thread writes to its index, the cache coherency protocol invalidates that entire cache line for the other core. The other thread's next read, even of a completely different variable on that cache line, must then reload the line from a shared level of cache or from the other core directly. This is the false sharing problem, and can be a significant bottleneck in high-throughput scenarios.
To mitigate this issue, we can add align and pad our variables so that each atomic variable lays at the beginning of one cache line, while the rest of the members (size, capacity, etc.) are placed in their own cache line. We can accomplish this like so:
#include <atomic>
#include <cstddef>
#include <new>
alignas(
std::hardware_destructive_interference_size) std::atomic_size_t readIdx_;
alignas(
std::hardware_destructive_interference_size) std::atomic_size_t writeIdx_;
// ensure that no other members are stored on this cache line
char padding[std::hardware_destructive_interference_size -
sizeof(std::atomic_size_t)];
... // rest of the membersAlthough this did not improve the throughput of our program by a ton, it's an improvement nonetheless.
Cached Index Strategy
The next optimization takes advantage of a subtle property of the read and write
indices: they only ever increment. When either thread needs to check the other
thread's index, it currently performs a costly acquire-load every time. But we
don't always need to. This strategy, used by many efficient SPSC queue
implementations including rigtorp's
SPSCQueue, is to cache the other
thread's index locally and acquire-load it as infrequently as possible. Consider
the push method: before writing to the buffer, we check if the queue is full by
comparing the write index against the read index. Since the read index is only
ever incremented by the consumer, any cached value of it acts as a lower bound
for the true value. If the cached value tells us the queue isn't full, we can
skip the acquire-load entirely. We only need to acquire-load the actual read
index when the cached value suggests the queue is full. But even then, we fail
only if the true value confirms it. Applying the same logic to pop: caching
the write index and only acquire-loading it when the queue appears empty, yields
a roughly 3x increase in throughput:
Batched Push
The last optimization came about while developing the graphics library that this
queue is a part of. I wanted to support selecting multiple items and moving them
in unison. Naively, this means one push call per item — each with its own
release-store to the write index. But if we know all the items up front, we can
memcpy them all into the buffer and increment the write index just once,
taking us from atomic stores down to one. Using this strategy with the same
pop method as before, the results speak for themselves:
From roughly 21 million ops/sec to over 218 million, I'm very happy with the results overall. The code for the graphics library along with the queue is in this github repo.