Single-file lock-free SPSC byte ring for embedded C++.
nc_ring.h is a C++ library for moving bytes between one producer thread and one consumer thread through a fixed buffer, without a lock. It never owns the buffer, never allocates, and never grows: the buffer is caller-provided and the cursors only advance on success, which makes it suitable for a UART or DMA pipe, an inter-core hand-off, or any single-producer/single-consumer queue on a target with a few KBs of RAM to spare.
The ring is a single nc_ring struct holding a base pointer, a capacity, and two
free-running cursors. It makes no heap allocations, no system calls, and takes no
locks. There is no read-modify-write on the hot path — each cursor has exactly one
writer, so a push or pop is a plain atomic load and store paired with acquire and
release. That means it is lock-free on cores that have no exclusive instructions at
all, including the Cortex-M0. Because the cursors run free rather than wrapping in
place, a full ring (Write - Read == Capacity) is distinct from an empty one, so
the buffer holds its whole capacity with no wasted slot.
The producer and consumer cursors are placed on separate cache lines, sized from the detected CPU, so the two threads never share a line and the only cross-core traffic is the cursor hand-off itself. The acquire/release pairing makes the ring correct on a weakly-ordered multicore part: the producer publishes the payload before the cursor, the consumer observes the cursor before the payload, and neither can see half-written data. This is validated end-to-end on a dual-core ESP32-WROOM (see Performance below).
The library has no platform dependencies. With NCRING_NO_STDLIB it drops every
standard library include and uses internal MemSet / MemCpy, so it builds in
freestanding.
Warning
Defining NCRING_NO_STDLIB currently uses the fallback MemCpy implementation
which copies a single byte at a time. It absolutely kills performance and
I recommend if NCRING_NO_STDLIB is necessary that you provide a better
implementation for your architecture.
Copy nc_ring.h into your project.
In one C++ source file, define NCRING_IMPLEMENTATION before including the
header:
#define NCRING_IMPLEMENTATION
#include "nc_ring.h"All other files that need the API include the header without the define.
Tip
To confine all symbols to a single translation unit, NCRING_STATIC can be
defined alongside NCRING_IMPLEMENTATION.
The atomics use the GCC/Clang __atomic builtins; the header errors on any other
compiler.
The following macros can be defined before including the header to replace default dependencies:
| Macro | Default | Purpose |
|---|---|---|
NCRING_NO_STDLIB |
Undefined | Suppresses <stdint.h> / <string.h>; caller must provide u8, u16, u32, u64, i8, i16, i32, i64, b32, b8, f32, f64 typedefs |
NCRING_STATIC |
Undefined | With NCRING_IMPLEMENTATION, gives every symbol internal linkage (NCRING_DEF becomes static) |
NCRING_MEMCPY(d, s, n) |
memcpy |
Payload copy. This sits on the read and write hot path — point it at an optimised word-copy, not a byte loop (see Performance) |
NC_CPU_CACHE_LINE_SIZE is detected from the target (32 on Xtensa and 32-bit ARM,
64 on 64-bit ARM/RISC-V/x86) and sets the cursor padding; manually change this yourself to
override.
Bind the ring to a buffer whose size is a power of two — the wrap is a mask, so a non-power-of-two capacity is rejected by an assert. One thread writes, one thread reads; the ring carries no protection against a second producer or consumer.
u8 buffer[4096]; // power of two
nc_ring ring = nc_ring_init(buffer, sizeof(buffer));nc_ring_write is the producer side. It is all-or-nothing: it copies the whole
record and returns its size, or copies nothing and returns 0 when there isn't
room. It never tears a record. Spin or retry on 0.
// ring full — consumer hasn't caught up
while (!nc_ring_write(&ring, &record, sizeof(record)))
CPUPause();
nc_ring_write_struct(&ring, &record); // sizeof(*ptr) wrappernc_ring_read is the consumer side, symmetric: it delivers the whole record and
returns its size, or returns 0 when fewer than Size bytes are available.
// ring empty — producer hasn't published
while (!nc_ring_read(&ring, &record, sizeof(record)))
CPUPause();
nc_ring_read_struct(&ring, &record); // sizeof(*ptr) wrapperTip
An implementation has been provided for CPUPause which should cover most common CPU architectures, but if yours is missing,
it is easily added manually in nc_ring.h.
Exactly one thread may call nc_ring_write and exactly one (other) thread may call
nc_ring_read. The two cursors are published with release and observed with
acquire, so the producer and consumer may run on different cores of a weakly-ordered
part with no further synchronisation.
Measured on a dual-core ESP32-WROOM (Xtensa LX6, 240 MHz), producer pinned to APP_CPU and consumer to PRO_CPU, one million records per run, NCRING_MEMCPY mapped to the
toolchain memcpy. Correctness held across every run (errors: 0), which is the
cross-core acquire/release doing its job — a missing barrier on this part corrupts
or stalls, it does not pass.
| Record size | Per record | End-to-end throughput |
|---|---|---|
| 8 B | 1.38 µs | ~5.6 MB/s |
| 512 B | 5.40 µs | ~92 MB/s |
| 2048 B | 19.8 µs | ~99 MB/s |
The cost is linear in record size and fits t(n) ≈ 0.87 µs + 9 ns·n, which separates into two regimes around a ~100-byte knee:
The 0.87 µs intercept is the fixed per-operation cost — the cross-core cursor hand-off — independent of payload. Below the knee the ring is sync-bound and the copy is essentially free; this is the regime small control-plane records live in, and it is where the ring is at its best.
The 9 ns/byte slope is the copy. Above the knee the ring is copy-bound and the
throughput asymptotes toward internal-RAM bandwidth (~108 MB/s). nc_ring
copies every payload byte twice — once in, once out — so end-to-end throughput tops
out near half the raw memcpy bandwidth; for bulk movement, pass a pointer through
the ring instead of the bytes.
The single biggest lever on the slope is NCRING_MEMCPY. The NCRING_NO_STDLIB
fallback copies a byte at a time and caps large-record throughput roughly 7× below an optimised word-copy. If you go freestanding, override NCRING_MEMCPY with a real memcpy.
Push a few records, then drain them. With one writer and one reader on the same thread the ordering is trivial.
struct rec {
u32 Seq;
u32 Val;
};
u8 buffer[64]; // power of two; holds 8 records
nc_ring ring = nc_ring_init(buffer, sizeof(buffer));
for (u32 i = 0; i < 4; ++i) {
rec r = {
i,
i * 10
};
nc_ring_write_struct(&ring, &r);
}
rec out;
while (nc_ring_read_struct(&ring, &out))
printf("seq=%u val=%u\n", out.Seq, out.Val);seq=0 val=0
seq=1 val=10
seq=2 val=20
seq=3 val=30The real workload: a producer task on one core streams self-describing records, a consumer task on the other verifies them. Each record carries a scrambled check of its sequence number, so a dropped, reordered, duplicated, or torn record fails the check — the test that actually exercises the cross-core barriers.
#define NCRING_IMPLEMENTATION
#include "nc_ring.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "freertos/semphr.h"
#define RING_BYTES 4096u
#define RECORDS 1000000u
struct rec { u32 Seq; u32 Check; };
static inline u32 scramble(u32 s) {
return (s * 2654435761u) ^ 0xA5A5A5A5u;
}
static u8 g_buf[RING_BYTES];
static nc_ring g_ring;
static SemaphoreHandle_t g_done;
static volatile u32 g_errors;
static void
producer(void*)
{
for (u32 seq = 0; seq < RECORDS; ++seq) {
rec r = {
seq,
scramble(seq)
};
while (!nc_ring_write_struct(&g_ring, &r))
CPUPause();
}
vTaskDelete(NULL);
}
static void
consumer(void*)
{
u32 errors = 0;
for (u32 expected = 0; expected < RECORDS; ++expected) {
rec r = {};
while (!nc_ring_read_struct(&g_ring, &r))
CPUPause();
if (r.Seq != expected || r.Check != scramble(expected))
++errors;
}
g_errors = errors;
xSemaphoreGive(g_done);
vTaskDelete(NULL);
}
extern "C"
void
app_main(void)
{
g_ring = nc_ring_init(g_buf, RING_BYTES);
g_done = xSemaphoreCreateBinary();
// Keep app_main above the workers while both are created, so neither
// preempts setup before the other exists, then block to hand them the cores.
vTaskPrioritySet(NULL, configMAX_PRIORITIES - 1);
xTaskCreatePinnedToCore(producer, "prod", 4096, NULL, 5, NULL, 1);
xTaskCreatePinnedToCore(consumer, "cons", 4096, NULL, 5, NULL, 0);
xSemaphoreTake(g_done, portMAX_DELAY);
printf("%s\n", g_errors ? "FAIL" : "PASS");
}PASS| Constraint | Value | Defined by |
|---|---|---|
| Buffer size | 4 GiB | u32 Capacity |
| Capacity | power of two | mask-based wrap + free-running cursors |
| Concurrency | 1 producer, 1 consumer | SPSC; lock-free has one writer per cursor |
Single-producer/single-consumer only. Exactly one thread may write and one other thread may read; a second writer or reader races with no protection, because the lock-free guarantee relies on each cursor having a single writer. For many-to-one or many-to-many, put a lock in front or use a different structure.
Capacity must be a power of two. The wrap is a bitmask and the full-versus-empty distinction depends on the capacity dividing 2³², so a non-power-of-two size is rejected at init.
Writes and reads are all-or-nothing. A call moves the whole Size or none of it;
there is no partial transfer and no occupancy query. For streaming semantics —
"give me whatever is there" — build a partial-count variant on top of the cursors.
A byte ring copies each payload byte twice. End-to-end throughput is therefore
bounded by half the memcpy bandwidth (see Performance); move pointers, not bytes,
when the payload is large.
GCC and Clang only. The atomics are the __atomic builtins and the header errors on
other compilers.