Lock-Free Design: Building a High-Performance SPSC Queue

Recently, I began working on a primitive graphics library written in C++ (source code here) that can draw, erase, and move shapes on the screen. It's a simple project, but I really enjoyed designing both the frontend API and the backend systems. One of my earliest design decisions, how the input handler interacted with the render engine, was one that piqued my interest and led me down a rabbit hole of concurrent system design. But first, let's look at the problem at hand.

The Queue

A queue is the natural answer here: the input handler pushes events, and the render engine drains them. Previously, I built a MPMC (multi-producer, multi-consumer) work queue for my particle simulator that was protected by a mutex and condition variable. This approach was incredibly effective given that the problem involved genuine contention with multiple threads racing to acquire a shared resource. In this problem, however, with exactly one producer thread and one consumer thread, that contention is largely absent, making a mutex an unnecessarily heavy primitive. This realization naturally pointed me towards a concurrent data structure without mutexes, which sent me down a long-winded journey of learning the secret sauce to the black magic that is lock-free programming.

I found this CppCon talk by Charles Frasch, which turned out to be exactly what I was looking for and acted as a great stepping stone in this process. The discussion outlined a single-producer, single-consumer (SPSC) concurrent queue that is free of any lock-based structure, operating solely with atomic variables. At its core, the underlying data structure at hand is a ring buffer, or a fixed-capacity queue that maintains separate read and write indices into a contiguous block of memory. The read and write indices are managed using atomic variables to avoid the numerous possible race conditions that can occur otherwise with partially observed index updates.

Why Lock-Free?

Why even consider a lock-free approach? Mutexes are a tried and true way to combat data races in shared resources. However, even in the absence of heavy contention, acquiring and releasing a mutex inherently carries overhead. This is either a kernel transition or a futex operation on every push and pop, regardless of whether another thread is actually waiting. With only one producer and one consumer, we can mitigate this per-operation cost entirely by coordinating directly through atomics.

Sequential Consistency

We will begin by enforcing sequentially-consistent memory ordering, where all threads agree on a single global order of atomic operations. This is the strongest guarantee available, but only needed in extreme cases. However, it is a good starting point regardless. To do this, we can use the following structure for pushing and popping elements, assuming that the queue has a capacity of some power of two, allowing for efficient modulo calculation via a bitwise and (read more about this here). Note that pop does not return value as, in our case, the queued item represents a command that takes effect upon being dequeued.

#include <atomic>
#include <cstddef>
#include <cstring>
 
std::atomic_size_t writeIdx_{}, readIdx_{};
T* buff_;
std::size_t capacity_; // power of two capacity
std::size_t mask = capacity_ - 1;
 
bool push(const T& val) noexcept {
  std::size_t writeIdx = writeIdx_.load(std::memory_order_seq_cst);
  std::size_t readIdx = readIdx_.load(std::memory_order_seq_cst);
  if (writeIdx - readIdx == capacity_) {
    return false;
  }
  std::memcpy(&buff_[writeIdx & mask], &val, sizeof(T));
  writeIdx_.store(writeIdx + 1, std::memory_order_seq_cst);
  return true;
}
 
bool pop() noexcept {
  std::size_t readIdx = readIdx_.load(std::memory_order_seq_cst);
  std::size_t writeIdx = writeIdx_.load(std::memory_order_seq_cst);
  if (readIdx == writeIdx) {
    return false;
  }
  // destruct object if necessary
  readIdx_.store(readIdx + 1, std::memory_order_seq_cst);
  return true;
}

This, naturally, is fairly slow and unoptimized. On my ARM machine, sequentially consistent memory operations use the ldar and stlr assembly instructions, which prevent the CPU from reordering any surrounding memory operations, even those that are unrelated, to ensure that any prior writes are globally visible before any subsequent operation proceeds. We will see shortly, however, that this approach is overkill. To illustrate this, I created a file to test throughput on a basic example, where two threads push and pop elements from the queue and spin when there is no work to be done to simulate contention. I stored 8-bit unsigned intergers in the queue for simplicity. Across $2^{27}$ operations per thread, the throughput results were as follows. Note that I compiled with the -O3 flag.

\begin{array}{|l|r|} \hline \textbf{Metric} & \textbf{Value} \\ \hline \text{Ops per thread} & 134{,}217{,}728 \\ \text{Time (s)} & 6.286 \\ \text{Throughput (ops/sec)} & 21{,}351{,}575 \\ \hline \end{array}

\textit{Table 1: Sequential Consistency}

Although this test may not be perfect, it demonstrates the queue's performance in a sustained load situation. A natural follow-up is: Can we do any better?

Acquire-Release Semantics

The answer is yes. Sequential consistency forces all threads to agree on a single global order of every atomic operation which is a much stronger guarantee than we actually need. In our SPSC queue, the only ordering that matters is between the producer and consumer threads where the consumer must not read a slot until the producer's write to that slot is visible. Acquire-release semantics guarantee exactly this.

The mental model is similar to a lock: when you release a mutex, every write you made while holding it becomes visible to whoever acquires the mutex next. In the context of acquire-release, a store with release semantics ensures that all preceding writes, including non-atomic ones (e.g. our call to memcpy in the snippet above), are visible to any thread that subsequently performs an acquire-load of the same atomic variable and observes the stored value. Critically, this prevents the compiler and CPU from reordering operations across the barrier (for more, see here). Without release semantics, the compiler may reorder the store to writeIdx_ ahead of the memcpy, meaning that the consumer could see an updated index and read from a slot with incomplete data. The benefit of acquire-release is it provides the same fence-like guarantee that mutexes provide without the costly syscalls and context switches.

Now, we can apply these semantics to our queue. It is important to note that both the producer and consumer threads both "own" an index in the sense that they each increment one: the producer increments the write index, and the consumer increments the read index. When either thread increments its own index after completing an operation, the store uses release semantics. This ensures that the preceding read or write to the buffer slot is visible before the updated index is published. The load of the other thread's index, however, uses acquire semantics which ensures that we see all writes that preceded its release-store. When a thread loads its own index, there is no other writer to synchronize with, so we can further relax the memory constraints with which we load the variable. In C++, relaxed memory ordering reports the variable as-is, meaning we may obtain a stale value if the other thread increments it. However, this does not break our implementation: since the indices only increment, a stale read acts as a lower bound of the given index. The queue may appear full or empty when it isn't, but we can never pop from a truly empty queue or push to a truly full one. On my ARM machine, this distinction matters at the instruction level: relaxed operations compile to plain register load and store instructions (ldr and str), and, as of ARMv8.3, acquire-loads use ldapr: a lighter-weight alternative to the ldar instruction that sequential consistency utilizes.

Changing the memory orderings in this way leads to the following results:

\begin{array}{|l|r|} \hline \textbf{Metric} & \textbf{Value} \\ \hline \text{Ops per thread} & 134{,}217{,}728 \\ \text{Time (s)} & 4.225 \\ \text{Throughput (ops/sec)} & 31{,}771{,}221.239 \\ \hline \end{array}

\textit{Table 2: Improved Memory Ordering}

The question remains: can we do better?

Again, the answer is yes. To see how, we must understand how our queue struct is laid out in memory. The member variables are placed contiguously, in the order they were declared. If the two atomic indices end up on the same cache line, which, at only 8 bytes each, is highly likely, then every time a thread writes to its index, the cache coherency protocol invalidates that entire cache line for the other core. The other thread's next read, even of a completely different variable on that cache line, must then reload the line from a shared level of cache or from the other core directly. This is the false sharing problem, and can be a significant bottleneck in high-throughput scenarios.

To mitigate this issue, we can add align and pad our variables so that each atomic variable lays at the beginning of one cache line, while the rest of the members (size, capacity, etc.) are placed in their own cache line. We can accomplish this like so:

#include <atomic>
#include <cstddef>
#include <new>
 
alignas(
    std::hardware_destructive_interference_size) std::atomic_size_t readIdx_;
alignas(
    std::hardware_destructive_interference_size) std::atomic_size_t writeIdx_;
// ensure that no other members are stored on this cache line
char padding[std::hardware_destructive_interference_size -
             sizeof(std::atomic_size_t)];
... // rest of the members

Although this did not improve the throughput of our program by a ton, it's an improvement nonetheless.

\begin{array}{|l|r|} \hline \textbf{Metric} & \textbf{Value} \\ \hline \text{Ops per thread} & 134{,}217{,}728 \\ \text{Time (s)} & 3.915 \\ \text{Throughput (ops/sec)} & 34{,}284{,}759.519 \\ \hline \end{array}

\textit{Table 3: Aligned Variables}

Cached Index Strategy

The next optimization takes advantage of a subtle property of the read and write indices: they only ever increment. When either thread needs to check the other thread's index, it currently performs a costly acquire-load every time. But we don't always need to. This strategy, used by many efficient SPSC queue implementations including rigtorp's SPSCQueue, is to cache the other thread's index locally and acquire-load it as infrequently as possible. Consider the push method: before writing to the buffer, we check if the queue is full by comparing the write index against the read index. Since the read index is only ever incremented by the consumer, any cached value of it acts as a lower bound for the true value. If the cached value tells us the queue isn't full, we can skip the acquire-load entirely. We only need to acquire-load the actual read index when the cached value suggests the queue is full. But even then, we fail only if the true value confirms it. Applying the same logic to pop: caching the write index and only acquire-loading it when the queue appears empty, yields a roughly 3x increase in throughput:

\begin{array}{|l|r|} \hline \textbf{Metric} & \textbf{Value} \\ \hline \text{Ops per thread} & 134{,}217{,}728 \\ \text{Time (s)} & 1.215 \\ \text{Throughput (ops/sec)} & 110{,}509{,}923.584 \\ \hline \end{array}

\textit{Table 4: Cached Index Strategy}

Batched Push

The last optimization came about while developing the graphics library that this queue is a part of. I wanted to support selecting multiple items and moving them in unison. Naively, this means one push call per item — each with its own release-store to the write index. But if we know all the items up front, we can memcpy them all into the buffer and increment the write index just once, taking us from $n$ atomic stores down to one. Using this strategy with the same pop method as before, the results speak for themselves:

\begin{array}{|l|r|} \hline \textbf{Metric} & \textbf{Value} \\ \hline \text{Ops per thread} & 134{,}217{,}728 \\ \text{Time (s)} & 0.614 \\ \text{Throughput (ops/sec)} & 218{,}592{,}933.598 \\ \hline \end{array}

\textit{Table 5: Batched Push}

From roughly 21 million ops/sec to over 218 million, I'm very happy with the results overall. The code for the graphics library along with the queue is in this github repo.