Skip to content

usrnatc/nc_ring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

nc_ring.h

Single-file lock-free SPSC byte ring for embedded C++.

About

nc_ring.h is a C++ library for moving bytes between one producer thread and one consumer thread through a fixed buffer, without a lock. It never owns the buffer, never allocates, and never grows: the buffer is caller-provided and the cursors only advance on success, which makes it suitable for a UART or DMA pipe, an inter-core hand-off, or any single-producer/single-consumer queue on a target with a few KBs of RAM to spare.

The ring is a single nc_ring struct holding a base pointer, a capacity, and two free-running cursors. It makes no heap allocations, no system calls, and takes no locks. There is no read-modify-write on the hot path — each cursor has exactly one writer, so a push or pop is a plain atomic load and store paired with acquire and release. That means it is lock-free on cores that have no exclusive instructions at all, including the Cortex-M0. Because the cursors run free rather than wrapping in place, a full ring (Write - Read == Capacity) is distinct from an empty one, so the buffer holds its whole capacity with no wasted slot.

The producer and consumer cursors are placed on separate cache lines, sized from the detected CPU, so the two threads never share a line and the only cross-core traffic is the cursor hand-off itself. The acquire/release pairing makes the ring correct on a weakly-ordered multicore part: the producer publishes the payload before the cursor, the consumer observes the cursor before the payload, and neither can see half-written data. This is validated end-to-end on a dual-core ESP32-WROOM (see Performance below).

The library has no platform dependencies. With NCRING_NO_STDLIB it drops every standard library include and uses internal MemSet / MemCpy, so it builds in freestanding.

Warning

Defining NCRING_NO_STDLIB currently uses the fallback MemCpy implementation which copies a single byte at a time. It absolutely kills performance and I recommend if NCRING_NO_STDLIB is necessary that you provide a better implementation for your architecture.

Installation

Copy nc_ring.h into your project.

In one C++ source file, define NCRING_IMPLEMENTATION before including the header:

#define NCRING_IMPLEMENTATION
#include "nc_ring.h"

All other files that need the API include the header without the define.

Tip

To confine all symbols to a single translation unit, NCRING_STATIC can be defined alongside NCRING_IMPLEMENTATION.

The atomics use the GCC/Clang __atomic builtins; the header errors on any other compiler.

Overrides

The following macros can be defined before including the header to replace default dependencies:

Macro Default Purpose
NCRING_NO_STDLIB Undefined Suppresses <stdint.h> / <string.h>; caller must provide u8, u16, u32, u64, i8, i16, i32, i64, b32, b8, f32, f64 typedefs
NCRING_STATIC Undefined With NCRING_IMPLEMENTATION, gives every symbol internal linkage (NCRING_DEF becomes static)
NCRING_MEMCPY(d, s, n) memcpy Payload copy. This sits on the read and write hot path — point it at an optimised word-copy, not a byte loop (see Performance)

NC_CPU_CACHE_LINE_SIZE is detected from the target (32 on Xtensa and 32-bit ARM, 64 on 64-bit ARM/RISC-V/x86) and sets the cursor padding; manually change this yourself to override.

Usage

Initialisation

Bind the ring to a buffer whose size is a power of two — the wrap is a mask, so a non-power-of-two capacity is rejected by an assert. One thread writes, one thread reads; the ring carries no protection against a second producer or consumer.

u8 buffer[4096];   // power of two

nc_ring ring = nc_ring_init(buffer, sizeof(buffer));

Writing

nc_ring_write is the producer side. It is all-or-nothing: it copies the whole record and returns its size, or copies nothing and returns 0 when there isn't room. It never tears a record. Spin or retry on 0.

// ring full — consumer hasn't caught up
while (!nc_ring_write(&ring, &record, sizeof(record)))
    CPUPause();

nc_ring_write_struct(&ring, &record);   // sizeof(*ptr) wrapper

Reading

nc_ring_read is the consumer side, symmetric: it delivers the whole record and returns its size, or returns 0 when fewer than Size bytes are available.

// ring empty — producer hasn't published
while (!nc_ring_read(&ring, &record, sizeof(record)))
    CPUPause();

nc_ring_read_struct(&ring, &record);    // sizeof(*ptr) wrapper

Tip

An implementation has been provided for CPUPause which should cover most common CPU architectures, but if yours is missing, it is easily added manually in nc_ring.h.

Threading contract

Exactly one thread may call nc_ring_write and exactly one (other) thread may call nc_ring_read. The two cursors are published with release and observed with acquire, so the producer and consumer may run on different cores of a weakly-ordered part with no further synchronisation.

Performance

Measured on a dual-core ESP32-WROOM (Xtensa LX6, 240 MHz), producer pinned to APP_CPU and consumer to PRO_CPU, one million records per run, NCRING_MEMCPY mapped to the toolchain memcpy. Correctness held across every run (errors: 0), which is the cross-core acquire/release doing its job — a missing barrier on this part corrupts or stalls, it does not pass.

Record size Per record End-to-end throughput
8 B 1.38 µs ~5.6 MB/s
512 B 5.40 µs ~92 MB/s
2048 B 19.8 µs ~99 MB/s

The cost is linear in record size and fits t(n) ≈ 0.87 µs + 9 ns·n, which separates into two regimes around a ~100-byte knee:

The 0.87 µs intercept is the fixed per-operation cost — the cross-core cursor hand-off — independent of payload. Below the knee the ring is sync-bound and the copy is essentially free; this is the regime small control-plane records live in, and it is where the ring is at its best.

The 9 ns/byte slope is the copy. Above the knee the ring is copy-bound and the throughput asymptotes toward internal-RAM bandwidth (~108 MB/s). nc_ring copies every payload byte twice — once in, once out — so end-to-end throughput tops out near half the raw memcpy bandwidth; for bulk movement, pass a pointer through the ring instead of the bytes.

The single biggest lever on the slope is NCRING_MEMCPY. The NCRING_NO_STDLIB fallback copies a byte at a time and caps large-record throughput roughly 7× below an optimised word-copy. If you go freestanding, override NCRING_MEMCPY with a real memcpy.

Examples

Single-thread round-trip

Push a few records, then drain them. With one writer and one reader on the same thread the ordering is trivial.

struct rec { 
    u32 Seq; 
    u32 Val; 
};

u8 buffer[64];   // power of two; holds 8 records
nc_ring ring = nc_ring_init(buffer, sizeof(buffer));

for (u32 i = 0; i < 4; ++i) {
    rec r = { 
        i, 
        i * 10 
    };

    nc_ring_write_struct(&ring, &r);
}

rec out;

while (nc_ring_read_struct(&ring, &out))
    printf("seq=%u val=%u\n", out.Seq, out.Val);
seq=0 val=0
seq=1 val=10
seq=2 val=20
seq=3 val=30

Dual-core ESP32 hand-off

The real workload: a producer task on one core streams self-describing records, a consumer task on the other verifies them. Each record carries a scrambled check of its sequence number, so a dropped, reordered, duplicated, or torn record fails the check — the test that actually exercises the cross-core barriers.

#define NCRING_IMPLEMENTATION
#include "nc_ring.h"

#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "freertos/semphr.h"

#define RING_BYTES 4096u
#define RECORDS    1000000u

struct rec { u32 Seq; u32 Check; };

static inline u32 scramble(u32 s) { 
    return (s * 2654435761u) ^ 0xA5A5A5A5u; 
}

static u8                g_buf[RING_BYTES];
static nc_ring           g_ring;
static SemaphoreHandle_t g_done;
static volatile u32      g_errors;

static void 
producer(void*)
{
    for (u32 seq = 0; seq < RECORDS; ++seq) {
        rec r = { 
            seq, 
            scramble(seq) 
        };

        while (!nc_ring_write_struct(&g_ring, &r))
            CPUPause();
    }

    vTaskDelete(NULL);
}

static void 
consumer(void*)
{
    u32 errors = 0;

    for (u32 expected = 0; expected < RECORDS; ++expected) {
        rec r = {};

        while (!nc_ring_read_struct(&g_ring, &r))
            CPUPause();

        if (r.Seq != expected || r.Check != scramble(expected))
            ++errors;
    }

    g_errors = errors;
    xSemaphoreGive(g_done);
    vTaskDelete(NULL);
}

extern "C" 
void 
app_main(void)
{
    g_ring = nc_ring_init(g_buf, RING_BYTES);
    g_done = xSemaphoreCreateBinary();

    // Keep app_main above the workers while both are created, so neither
    // preempts setup before the other exists, then block to hand them the cores.
    vTaskPrioritySet(NULL, configMAX_PRIORITIES - 1);
    xTaskCreatePinnedToCore(producer, "prod", 4096, NULL, 5, NULL, 1);
    xTaskCreatePinnedToCore(consumer, "cons", 4096, NULL, 5, NULL, 0);
    xSemaphoreTake(g_done, portMAX_DELAY);
    printf("%s\n", g_errors ? "FAIL" : "PASS");
}
PASS

Limitations

Constraint Value Defined by
Buffer size 4 GiB u32 Capacity
Capacity power of two mask-based wrap + free-running cursors
Concurrency 1 producer, 1 consumer SPSC; lock-free has one writer per cursor

Single-producer/single-consumer only. Exactly one thread may write and one other thread may read; a second writer or reader races with no protection, because the lock-free guarantee relies on each cursor having a single writer. For many-to-one or many-to-many, put a lock in front or use a different structure.

Capacity must be a power of two. The wrap is a bitmask and the full-versus-empty distinction depends on the capacity dividing 2³², so a non-power-of-two size is rejected at init.

Writes and reads are all-or-nothing. A call moves the whole Size or none of it; there is no partial transfer and no occupancy query. For streaming semantics — "give me whatever is there" — build a partial-count variant on top of the cursors.

A byte ring copies each payload byte twice. End-to-end throughput is therefore bounded by half the memcpy bandwidth (see Performance); move pointers, not bytes, when the payload is large.

GCC and Clang only. The atomics are the __atomic builtins and the header errors on other compilers.

About

Single-file lock-free SPSC byte ring for embedded C++

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages