diff --git a/artifacts/README.md b/artifacts/README.md
new file mode 100644
index 000000000..1a2f14d40
--- /dev/null
+++ b/artifacts/README.md
@@ -0,0 +1,95 @@
+# Kahan vs naive accumulation in `cpu_statevec_anyCtrlAnyTargDenseMatr_sub()`
+
+Benchmark + analysis for issue #598 / PR #784. Measures the runtime and accuracy
+effect of compensated (Kahan) summation vs the original uncompensated accumulation
+in the CPU dense-matrix inner-product loop, across the three QuEST `qcomp`
+precisions, on a single CPU.
+
+## Files
+
+| file | purpose |
+| --- | --- |
+| `bench_kahan.cpp` | Standalone harness replicating the exact accumulation kernel (naive, Kahan, and a genuine-quad `__float128` reference) for fp1/fp2/fp4. |
+| `plot_kahan.py` | Generates the three figures from `results.csv`. |
+| `e2e_compmatr.cpp` | End-to-end check against the real built QuEST library (Kahan build vs `QUEST_DENSE_ACCUM_NAIVE` build) using the recipe from issue #598. |
+| `results.csv` | Raw benchmark data. |
+| `kahan_accuracy.png`, `kahan_runtime.png`, `kahan_costbenefit.png` | Figures. |
+
+## How the function was made selectable
+
+`cpu_statevec_anyCtrlAnyTargDenseMatr_sub()` now wraps the accumulation in a
+compile-time switch. The default is compensated (Kahan) summation (preserving the
+PR behaviour and its regression test); defining `QUEST_DENSE_ACCUM_NAIVE` reverts to
+the original `sum += elem * cache[j]` for apples-to-apples benchmarking. Example:
+
+```bash
+cmake -S . -B build -D QUEST_FLOAT_PRECISION=2 -D QUEST_ENABLE_OMP=OFF -D QUEST_ENABLE_MPI=OFF
+# naive variant for benchmarking:
+cmake -S . -B build_naive ... -D CMAKE_CXX_FLAGS="-DQUEST_DENSE_ACCUM_NAIVE"
+```
+
+## Reproduce
+
+```bash
+# benchmark (GCC needed for __float128 genuine-quad reference)
+g++-15 -O3 -std=c++17 bench_kahan.cpp -o bench_kahan -lquadmath
+./bench_kahan > results.csv 2> host_info.txt
+python3 plot_kahan.py
+```
+
+## The three-precision reality on this host (IMPORTANT)
+
+On this Apple-Silicon macOS host, **both AppleClang and Homebrew GCC report
+`sizeof(long double) == 8` with a 53-bit mantissa** — i.e. `long double` is just
+`double`. QuEST's fp4 build (which uses `long double`) is therefore **NOT a genuine
+quadruple-precision build here**, and its accuracy is identical to fp2. This is
+visible in `results.csv`: every fp4 error equals the corresponding fp2 error exactly.
+
+Consequently:
+- The fp1/fp2/fp4 **runtime** numbers are all valid (they are what those builds
+  actually do on this machine).
+- For the **accuracy ground truth** we do NOT use `long double`. We use GCC's
+  `__float128` (113-bit mantissa, genuine quad, via quadmath) as the reference and
+  measure each precision's error against it.
+
+## Key results (worst-case abs error vs 113-bit `__float128` reference)
+
+Adversarial ill-conditioned matrix (entries spanning 1e-6 .. 1e6). `ms` is per
+single-CPU `applyCompMatr`; lower is better.
+
+| precision | targets | err naive | err Kahan | err reduction | ms naive | ms Kahan | slowdown |
+| --- | ---: | --- | --- | ---: | ---: | ---: | ---: |
+| fp1 | 8  | 1.57e5 | 4.90e4 | 3.2x | 0.038 | 0.141 | 3.7x |
+| fp1 | 14 | 9.54e6 | 3.21e5 | 30x  | 184   | 611   | 3.3x |
+| fp2 | 8  | 3.49e-4 | 5.28e-5 | 6.6x | 0.032 | 0.133 | 4.2x |
+| fp2 | 14 | 3.29e-2 | 6.54e-4 | 50x  | 175   | 608   | 3.5x |
+| fp4 | 14 | 3.29e-2 | 6.54e-4 | 50x  | 173   | 608   | 3.5x |  *(== fp2; not true quad)*
+
+## Conclusion
+
+- **Accuracy benefit grows with the number of targets**, exactly as the issue
+  anticipated: negligible at 2–4 targets (no cancellation yet), then 1–2 orders of
+  magnitude error reduction by 12–14 targets, for both single and double precision.
+- **Cost is roughly constant at ~3.3–4.2x slowdown** at every non-trivial size *in
+  this in-cache micro-benchmark*. Here the matrix+cache stay resident across `reps`,
+  so the loop is compute-bound (4 complex ops per term vs 1) and the overhead does not
+  amortise — it is *not* "time-free". Caveat: the real per-apply cost at large target
+  counts (e.g. a 16384x16384 fp2 matrix is ~4 GB) is memory-bandwidth bound, so the
+  *relative* Kahan cost in production may be smaller than this micro-benchmark shows;
+  treat 3.3–4.2x as a standalone-kernel upper bound, not a real-library figure.
+- **End-to-end** the modified function turns a fully cancelled `amp0 = 0` (naive) into
+  `amp0 = 4096` (Kahan) for a 12-target adversarial `CompMatr` whose exact answer is
+  `numRows-2 = 4094` — i.e. naive loses the result entirely while Kahan recovers it to
+  within +2 (a residual O(1) error after subtracting 1e18-scale terms).
+
+So Kahan is worthwhile for large or ill-conditioned dense matrices where the
+accuracy matters, but a blanket 3–4x slowdown is too costly for small/well-conditioned
+matrices. A size threshold (e.g. enable Kahan only above ~6–8 targets) is the natural
+compromise; the `QUEST_DENSE_ACCUM_NAIVE` toggle is the simplest hook for that.
+
+Note on scope of the benefit: the 30x/50x error reductions above are for a
+deliberately adversarial, non-unitary matrix (entries spanning 1e-6 .. 1e6 with
+cancelling large pairs). Real `CompMatr` inputs are typically unitary with O(1)
+entries, where cancellation — and hence the benefit — is far milder at any target
+count. The threshold argument is therefore a worst-case bound, not the expected
+benefit on well-conditioned operators.
diff --git a/artifacts/bench_kahan.cpp b/artifacts/bench_kahan.cpp
new file mode 100644
index 000000000..fbaf0ce8a
--- /dev/null
+++ b/artifacts/bench_kahan.cpp
@@ -0,0 +1,221 @@
+/*
+ * Cost/benefit benchmark for compensated (Kahan) vs naive accumulation in the
+ * dense-matrix inner-product kernel of cpu_statevec_anyCtrlAnyTargDenseMatr_sub().
+ *
+ * This harness replicates the accumulation performed by that QuEST function:
+ * each output amplitude is  sum_j  matr[k][j] * cache[j], where j ranges over the
+ * 2^numTargets matrix columns. We reproduce that serial inner loop once with naive
+ * summation and once with complex Kahan summation. NOTE: the harness uses
+ * std::complex<Real> rather than QuEST's base_qcomp type; the two are arithmetically
+ * identical here (both perform plain componentwise IEEE real/imaginary arithmetic, so
+ * complex Kahan degenerates to two independent real Kahan sums), but this is not
+ * literally QuEST's struct. The real-library accuracy claims (regression test, e2e
+ * driver) exercise the actual base_qcomp type. Precisions:
+ *
+ *     fp1  ->  std::complex<float>       (24-bit mantissa)
+ *     fp2  ->  std::complex<double>      (53-bit mantissa)
+ *     fp4  ->  std::complex<long double> (QuEST's "quad" precision)
+ *
+ * IMPORTANT HONESTY NOTE on fp4:
+ *   On this Apple-Silicon macOS host, BOTH AppleClang and Homebrew GCC report
+ *   sizeof(long double)==8 with a 53-bit mantissa -- i.e. long double IS double.
+ *   Therefore QuEST's fp4 build is NOT a genuine quadruple-precision build here,
+ *   and fp4 accuracy is expected to equal fp2. We report fp4 anyway (it is what a
+ *   user compiling QUEST_FLOAT_PRECISION=4 on this machine actually gets) but flag
+ *   it as not-truly-quad.
+ *
+ *   For the ACCURACY GROUND-TRUTH we therefore do NOT use long double. We use GCC's
+ *   __float128 (113-bit mantissa, genuine quad) via quadmath as the reference, and
+ *   measure each precision's error against that genuinely-higher-precision result.
+ *
+ * Build (GCC required for __float128 / quadmath):
+ *   g++-15 -O3 -std=c++17 bench_kahan.cpp -o bench_kahan -lquadmath
+ */
+
+#include <complex>
+#include <vector>
+#include <random>
+#include <chrono>
+#include <cstdio>
+#include <cmath>
+#include <quadmath.h>
+
+using clk = std::chrono::steady_clock;
+
+// ---- genuine-quad complex (reference) built on __float128 -------------------
+struct qcomp128 {
+    __float128 re, im;
+};
+static inline qcomp128 operator*(qcomp128 a, qcomp128 b) {
+    return { a.re*b.re - a.im*b.im, a.re*b.im + a.im*b.re };
+}
+static inline qcomp128 operator+(qcomp128 a, qcomp128 b) {
+    return { a.re+b.re, a.im+b.im };
+}
+
+// naive accumulation, templated on the working complex type T (== std::complex<...>)
+template <typename T, typename Elem, typename Cache>
+T accumulate_naive(const Elem& row, const Cache& cache, long dim) {
+    T sum(0, 0);
+    for (long j = 0; j < dim; j++)
+        sum += row[j] * cache[j];
+    return sum;
+}
+
+// compensated (Kahan) accumulation -- mirrors the QuEST function body exactly
+template <typename T, typename Elem, typename Cache>
+T accumulate_kahan(const Elem& row, const Cache& cache, long dim) {
+    T sum(0, 0);
+    T compensation(0, 0);
+    for (long j = 0; j < dim; j++) {
+        T product     = row[j] * cache[j];
+        T corrected   = product - compensation;
+        T next        = sum + corrected;
+        compensation  = (next - sum) - corrected;
+        sum           = next;
+    }
+    return sum;
+}
+
+// reference accumulation in genuine quad (113-bit) -- naive is fine since the
+// reference type so vastly out-precisions the candidates that its own rounding
+// is negligible relative to fp1/fp2 error.
+qcomp128 accumulate_ref(const std::vector<qcomp128>& row, const std::vector<qcomp128>& cache, long dim) {
+    qcomp128 sum{0, 0};
+    for (long j = 0; j < dim; j++)
+        sum = sum + (row[j] * cache[j]);
+    return sum;
+}
+
+template <typename Real>
+struct Result {
+    double abs_err_naive;   // |naive   - ref|
+    double abs_err_kahan;   // |kahan   - ref|
+    double ms_naive;        // ms per full apply (all output amps)
+    double ms_kahan;
+};
+
+// Run one (precision, numTargets) cell. We build a random unitary-ish dense matrix
+// and a random input vector, in genuine quad, then DOWN-CAST to the working precision
+// so that naive/kahan see identically-rounded inputs and differ only in the sum.
+template <typename Real>
+Result<Real> run_cell(int numTargets, int reps, unsigned seed) {
+    using T = std::complex<Real>;
+    const long dim = 1L << numTargets;          // matrix is dim x dim, vector length dim
+
+    std::mt19937_64 rng(seed);
+    std::uniform_real_distribution<double> uni(-1.0, 1.0);
+
+    // reference-precision matrix (row-major) and input cache
+    std::vector<qcomp128> matr128((size_t)dim * dim);
+    std::vector<qcomp128> cache128(dim);
+
+    // To exercise cancellation realistically, scale entries across many magnitudes:
+    // mix O(1) terms with a few very large +/- pairs that must cancel.
+    for (long i = 0; i < dim; i++) {
+        double mag = std::pow(10.0, uni(rng) * 6.0);   // 1e-6 .. 1e6
+        cache128[i] = { (__float128)(uni(rng) * mag), (__float128)(uni(rng) * mag) };
+    }
+    for (long r = 0; r < dim; r++)
+        for (long c = 0; c < dim; c++) {
+            double mag = std::pow(10.0, uni(rng) * 6.0);
+            matr128[(size_t)r*dim + c] = { (__float128)(uni(rng) * mag), (__float128)(uni(rng) * mag) };
+        }
+
+    // down-cast to working precision
+    std::vector<T> matr((size_t)dim * dim);
+    std::vector<T> cache(dim);
+    for (size_t i = 0; i < matr128.size(); i++)
+        matr[i] = T((Real)(double)matr128[i].re, (Real)(double)matr128[i].im);
+    for (long i = 0; i < dim; i++)
+        cache[i] = T((Real)(double)cache128[i].re, (Real)(double)cache128[i].im);
+
+    // genuine-quad reference for every output amp
+    std::vector<qcomp128> ref(dim);
+    {
+        std::vector<qcomp128> rowbuf(dim);
+        for (long k = 0; k < dim; k++) {
+            for (long j = 0; j < dim; j++)
+                rowbuf[j] = matr128[(size_t)k*dim + j];
+            ref[k] = accumulate_ref(rowbuf, cache128, dim);
+        }
+    }
+
+    // helper to view row k of the working-precision matrix
+    auto rowptr = [&](long k) { return &matr[(size_t)k*dim]; };
+
+    // accuracy: worst-case (max over output amps) absolute error vs quad reference
+    double err_naive = 0, err_kahan = 0;
+    for (long k = 0; k < dim; k++) {
+        T sn = accumulate_naive<T>(rowptr(k), cache, dim);
+        T sk = accumulate_kahan<T>(rowptr(k), cache, dim);
+        double dn = std::hypot((double)((__float128)(Real)sn.real() - ref[k].re),
+                               (double)((__float128)(Real)sn.imag() - ref[k].im));
+        double dk = std::hypot((double)((__float128)(Real)sk.real() - ref[k].re),
+                               (double)((__float128)(Real)sk.imag() - ref[k].im));
+        if (dn > err_naive) err_naive = dn;
+        if (dk > err_kahan) err_kahan = dk;
+    }
+
+    // runtime: full apply = compute all dim output amps; repeat for stable timing
+    volatile double sink = 0;
+    auto time_variant = [&](bool kahan) {
+        // warm up
+        for (long k = 0; k < dim; k++) {
+            T s = kahan ? accumulate_kahan<T>(rowptr(k), cache, dim)
+                        : accumulate_naive<T>(rowptr(k), cache, dim);
+            sink += (double)s.real();
+        }
+        auto t0 = clk::now();
+        for (int rep = 0; rep < reps; rep++)
+            for (long k = 0; k < dim; k++) {
+                T s = kahan ? accumulate_kahan<T>(rowptr(k), cache, dim)
+                            : accumulate_naive<T>(rowptr(k), cache, dim);
+                sink += (double)s.real();
+            }
+        auto t1 = clk::now();
+        double total_ms = std::chrono::duration<double, std::milli>(t1 - t0).count();
+        return total_ms / reps;   // ms per full apply
+    };
+
+    Result<Real> res;
+    res.abs_err_naive = err_naive;
+    res.abs_err_kahan = err_kahan;
+    res.ms_naive = time_variant(false);
+    res.ms_kahan = time_variant(true);
+    (void)sink;
+    return res;
+}
+
+template <typename Real>
+void run_precision(const char* label, const int* targs, int ntargs) {
+    for (int t = 0; t < ntargs; t++) {
+        int nt = targs[t];
+        // more reps for tiny sizes to get stable timing; fewer for big sizes
+        long dim = 1L << nt;
+        int reps = (int)std::max(1L, (long)(2'000'000 / (dim * dim)));
+        Result<Real> r = run_cell<Real>(nt, reps, 0xC0FFEEu + nt);
+        // CSV: precision,numTargets,dim,reps,err_naive,err_kahan,ms_naive,ms_kahan
+        printf("%s,%d,%ld,%d,%.6e,%.6e,%.6e,%.6e\n",
+               label, nt, dim, reps,
+               r.abs_err_naive, r.abs_err_kahan, r.ms_naive, r.ms_kahan);
+        fflush(stdout);
+    }
+}
+
+int main() {
+    // report what long double actually is on this host, for the record
+    fprintf(stderr, "host: sizeof(float)=%zu sizeof(double)=%zu sizeof(long double)=%zu "
+                    "ld_mant=%d  sizeof(__float128)=%zu f128_mant=%d\n",
+            sizeof(float), sizeof(double), sizeof(long double),
+            (int)__LDBL_MANT_DIG__, sizeof(__float128), (int)FLT128_MANT_DIG);
+
+    const int targs[] = {2, 4, 6, 8, 10, 12, 13, 14};
+    const int ntargs = sizeof(targs)/sizeof(targs[0]);
+
+    printf("precision,numTargets,dim,reps,err_naive,err_kahan,ms_naive,ms_kahan\n");
+    run_precision<float>       ("fp1", targs, ntargs);  // single
+    run_precision<double>      ("fp2", targs, ntargs);  // double
+    run_precision<long double> ("fp4", targs, ntargs);  // QuEST "quad" = double on this host
+    return 0;
+}
diff --git a/artifacts/e2e_compmatr.cpp b/artifacts/e2e_compmatr.cpp
new file mode 100644
index 000000000..0682f762f
--- /dev/null
+++ b/artifacts/e2e_compmatr.cpp
@@ -0,0 +1,39 @@
+// End-to-end check against the REAL QuEST library (the actual modified function),
+// adapted from the recipe in issue #598. Applies a large all-ones-ish CompMatr to a
+// fixed random state and reports amp[0] at high precision. Linked once against the
+// default (Kahan) build and once against the QUEST_DENSE_ACCUM_NAIVE build to show
+// the modified function's accuracy difference end-to-end.
+#include "quest.h"
+#include <cstdio>
+#include <vector>
+
+int main() {
+    initQuESTEnv();
+    int numTargets = 12;
+    int targets[] = {0,1,2,3,4,5,6,7,8,9,10,11};
+
+    CompMatr matrix = createCompMatr(numTargets);
+    // adversarial row 0: large +/- pair that cancels, plus many O(1) terms.
+    for (qindex i=0; i<matrix.numRows; i++)
+        for (qindex j=0; j<matrix.numRows; j++)
+            matrix.cpuElems[i][j] = 1;
+    matrix.cpuElems[0][0] = 1e18;
+    matrix.cpuElems[0][matrix.numRows-1] = -1e18;
+    syncCompMatr(matrix);
+
+    Qureg qureg = createQureg(numTargets);
+    // fixed deterministic state: all amps = 1 (unnormalised; validation off)
+    std::vector<qcomp> ones(qureg.numAmps, qcomp(1,0));
+    setQuregAmps(qureg, 0, ones);
+
+    setQuESTValidationEpsilon(0);
+    applyCompMatr(qureg, targets, numTargets, matrix);
+    qcomp amp = getQuregAmp(qureg, 0);
+    // expected exact amp[0] = (1e18) + (numRows-2)*1 + (-1e18) = numRows-2
+    printf("amp0.real = %.20g   (exact = %lld)\n", (double)real(amp), (long long)matrix.numRows-2);
+
+    destroyCompMatr(matrix);
+    destroyQureg(qureg);
+    finalizeQuESTEnv();
+    return 0;
+}
diff --git a/artifacts/host_info.txt b/artifacts/host_info.txt
new file mode 100644
index 000000000..54aaa4fbf
--- /dev/null
+++ b/artifacts/host_info.txt
@@ -0,0 +1 @@
+host: sizeof(float)=4 sizeof(double)=8 sizeof(long double)=8 ld_mant=53  sizeof(__float128)=16 f128_mant=113
diff --git a/artifacts/kahan_accuracy.png b/artifacts/kahan_accuracy.png
new file mode 100644
index 000000000..1e2a4cf50
Binary files /dev/null and b/artifacts/kahan_accuracy.png differ
diff --git a/artifacts/kahan_costbenefit.png b/artifacts/kahan_costbenefit.png
new file mode 100644
index 000000000..8d5e145bb
Binary files /dev/null and b/artifacts/kahan_costbenefit.png differ
diff --git a/artifacts/kahan_runtime.png b/artifacts/kahan_runtime.png
new file mode 100644
index 000000000..94d7800b2
Binary files /dev/null and b/artifacts/kahan_runtime.png differ
diff --git a/artifacts/plot_kahan.py b/artifacts/plot_kahan.py
new file mode 100644
index 000000000..ce23ee39a
--- /dev/null
+++ b/artifacts/plot_kahan.py
@@ -0,0 +1,100 @@
+#!/usr/bin/env python3
+"""
+Visualise the cost/benefit of compensated (Kahan) vs naive accumulation in
+cpu_statevec_anyCtrlAnyTargDenseMatr_sub(), per qcomp precision.
+
+Input : results.csv (produced by bench_kahan)
+Output: kahan_accuracy.png, kahan_runtime.png, kahan_costbenefit.png
+"""
+import csv
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+
+rows = []
+with open("results.csv") as f:
+    for r in csv.DictReader(f):
+        rows.append(r)
+
+precs = ["fp1", "fp2", "fp4"]
+labels = {
+    "fp1": "fp1  (float, 24-bit)",
+    "fp2": "fp2  (double, 53-bit)",
+    "fp4": "fp4* (long double = double here, 53-bit)",
+}
+colors = {"fp1": "#d62728", "fp2": "#1f77b4", "fp4": "#2ca02c"}
+
+def series(p, key):
+    xs, ys = [], []
+    for r in rows:
+        if r["precision"] == p:
+            xs.append(int(r["numTargets"]))
+            ys.append(float(r[key]))
+    return xs, ys
+
+# ---------------- accuracy ----------------
+fig, ax = plt.subplots(figsize=(8, 5.5))
+for p in precs:
+    x, yn = series(p, "err_naive")
+    _, yk = series(p, "err_kahan")
+    ax.plot(x, yn, "--o", color=colors[p], label=f"{labels[p]} - naive", alpha=0.9)
+    ax.plot(x, yk, "-s", color=colors[p], label=f"{labels[p]} - Kahan", alpha=0.9)
+ax.set_yscale("log")
+ax.set_xlabel("number of target qubits  (matrix is 2^n x 2^n)")
+ax.set_ylabel("worst-case abs error vs __float128 (113-bit) reference")
+ax.set_title("Accuracy: Kahan vs naive dense-matrix accumulation\n"
+             "(adversarial ill-conditioned matrix, single-CPU)")
+ax.grid(True, which="both", alpha=0.3)
+ax.legend(fontsize=7, loc="upper left")
+fig.tight_layout()
+fig.savefig("kahan_accuracy.png", dpi=130)
+
+# ---------------- runtime ----------------
+fig, ax = plt.subplots(figsize=(8, 5.5))
+for p in precs:
+    x, tn = series(p, "ms_naive")
+    _, tk = series(p, "ms_kahan")
+    ax.plot(x, tn, "--o", color=colors[p], label=f"{labels[p]} - naive")
+    ax.plot(x, tk, "-s", color=colors[p], label=f"{labels[p]} - Kahan")
+ax.set_yscale("log")
+ax.set_xlabel("number of target qubits  (matrix is 2^n x 2^n)")
+ax.set_ylabel("runtime per applyCompMatr  [ms]  (single CPU, -O3)")
+ax.set_title("Runtime cost: Kahan vs naive dense-matrix accumulation")
+ax.grid(True, which="both", alpha=0.3)
+ax.legend(fontsize=7, loc="upper left")
+fig.tight_layout()
+fig.savefig("kahan_runtime.png", dpi=130)
+
+# ---------------- combined cost/benefit ----------------
+fig, (a1, a2) = plt.subplots(1, 2, figsize=(13, 5.5))
+for p in precs:
+    x, yn = series(p, "err_naive")
+    _, yk = series(p, "err_kahan")
+    # accuracy benefit = error reduction factor (naive/kahan)
+    benefit = [(n / k) if k > 0 else 1.0 for n, k in zip(yn, yk)]
+    a1.plot(x, benefit, "-o", color=colors[p], label=labels[p])
+a1.axhline(1.0, color="k", lw=0.8, ls=":")
+a1.set_yscale("log")
+a1.set_xlabel("number of target qubits")
+a1.set_ylabel("accuracy BENEFIT  =  err_naive / err_kahan  (>1 = Kahan better)")
+a1.set_title("Benefit: error-reduction factor from Kahan")
+a1.grid(True, which="both", alpha=0.3)
+a1.legend(fontsize=8)
+
+for p in precs:
+    x, tn = series(p, "ms_naive")
+    _, tk = series(p, "ms_kahan")
+    cost = [k / n if n > 0 else 1.0 for n, k in zip(tn, tk)]
+    a2.plot(x, cost, "-s", color=colors[p], label=labels[p])
+a2.axhline(1.0, color="k", lw=0.8, ls=":")
+a2.set_xlabel("number of target qubits")
+a2.set_ylabel("runtime COST  =  ms_kahan / ms_naive  (>1 = Kahan slower)")
+a2.set_title("Cost: runtime slowdown from Kahan")
+a2.grid(True, alpha=0.3)
+a2.legend(fontsize=8)
+fig.suptitle("Cost / Benefit of Kahan summation in cpu_statevec_anyCtrlAnyTargDenseMatr_sub()",
+             fontsize=12)
+fig.tight_layout(rect=[0, 0, 1, 0.96])
+fig.savefig("kahan_costbenefit.png", dpi=130)
+
+print("wrote kahan_accuracy.png, kahan_runtime.png, kahan_costbenefit.png")
diff --git a/artifacts/results.csv b/artifacts/results.csv
new file mode 100644
index 000000000..0e9c9c580
--- /dev/null
+++ b/artifacts/results.csv
@@ -0,0 +1,25 @@
+precision,numTargets,dim,reps,err_naive,err_kahan,ms_naive,ms_kahan
+fp1,2,4,125000,5.772414e-01,5.772414e-01,8.144000e-06,1.000000e-05
+fp1,4,16,7812,5.412803e+02,3.793770e+02,1.137993e-04,2.629288e-04
+fp1,6,64,488,6.479917e+04,4.669267e+04,2.247951e-03,7.772541e-03
+fp1,8,256,30,1.572572e+05,4.900399e+04,3.790000e-02,1.410667e-01
+fp1,10,1024,1,5.970935e+05,1.327381e+05,6.660000e-01,2.438000e+00
+fp1,12,4096,1,2.296930e+06,2.200285e+05,1.123600e+01,3.754900e+01
+fp1,13,8192,1,4.791550e+06,2.534412e+05,4.654500e+01,1.447900e+02
+fp1,14,16384,1,9.542947e+06,3.213748e+05,1.843500e+02,6.110560e+02
+fp2,2,4,125000,5.681711e-10,5.681711e-10,6.760000e-06,9.752000e-06
+fp2,4,16,7812,1.122881e-06,1.122881e-06,1.017665e-04,1.899642e-04
+fp2,6,64,488,1.455275e-04,8.359876e-05,1.639344e-03,5.977459e-03
+fp2,8,256,30,3.492193e-04,5.282792e-05,3.206667e-02,1.330667e-01
+fp2,10,1024,1,1.960401e-03,1.953018e-04,6.750000e-01,2.257000e+00
+fp2,12,4096,1,1.055728e-02,3.008565e-04,1.093100e+01,3.787300e+01
+fp2,13,8192,1,2.110199e-02,4.374301e-04,4.323500e+01,1.475820e+02
+fp2,14,16384,1,3.292284e-02,6.538236e-04,1.751520e+02,6.084400e+02
+fp4,2,4,125000,5.681711e-10,5.681711e-10,7.256000e-06,9.960000e-06
+fp4,4,16,7812,1.122881e-06,1.122881e-06,1.086790e-04,1.980287e-04
+fp4,6,64,488,1.455275e-04,8.359876e-05,1.657787e-03,6.247951e-03
+fp4,8,256,30,3.492193e-04,5.282792e-05,3.270000e-02,1.344667e-01
+fp4,10,1024,1,1.960401e-03,1.953018e-04,6.720000e-01,2.146000e+00
+fp4,12,4096,1,1.055728e-02,3.008565e-04,1.061000e+01,3.800100e+01
+fp4,13,8192,1,2.110199e-02,4.374301e-04,4.505800e+01,1.536850e+02
+fp4,14,16384,1,3.292284e-02,6.538236e-04,1.731930e+02,6.080500e+02
diff --git a/quest/src/cpu/cpu_subroutines.cpp b/quest/src/cpu/cpu_subroutines.cpp
index 59df946e9..a52131f0d 100644
--- a/quest/src/cpu/cpu_subroutines.cpp
+++ b/quest/src/cpu/cpu_subroutines.cpp
@@ -607,8 +607,17 @@ void cpu_statevec_anyCtrlAnyTargDenseMatr_sub(Qureg qureg, ConstList64 ctrls, Co
 
                 // i = nth local index where ctrls are active and targs form value k
                 qindex i = setBits(i0, targs.data(), numTargBits, k); // loop may be unrolled
-                amps[i] = getCpuQcomp(0, 0);
-            
+                cpu_qcomp sum = getCpuQcomp(0, 0);
+
+                // the inner-product accumulation below uses compensated (Kahan) summation to
+                // reduce catastrophic cancellation when combining the (up to 2^16) cache amps.
+                // The running compensation 'c' tracks the low-order bits lost by 'sum += ...'.
+                // Defining QUEST_DENSE_ACCUM_NAIVE reverts to the original uncompensated sum,
+                // enabling apples-to-apples runtime/accuracy benchmarking of the two variants.
+                #ifndef QUEST_DENSE_ACCUM_NAIVE
+                cpu_qcomp compensation = getCpuQcomp(0, 0);
+                #endif
+
                 // loop may be unrolled
                 for (qindex j=0; j<numTargAmps; j++) {
 
@@ -624,18 +633,22 @@ void cpu_statevec_anyCtrlAnyTargDenseMatr_sub(Qureg qureg, ConstList64 ctrls, Co
                     if constexpr (ApplyConj)
                         elem = conj(elem);
 
-                    amps[i] += elem * cache[j];
-
-                    /// @todo
-                    /// qureg.cpuAmps[i] is being serially updated by only this thread,
-                    /// so is a candidate for Kahan summation for improved numerical
-                    /// stability. Explore whether this is time-free and worthwhile!
-                    ///
-                    /// BEWARE that Kahan summation may be incompatible with
-                    /// the commutator tricks used in base_qcomp's (ancestor
-                    /// of cpu_qcomp) arithmetic operator overloads. Check
-                    /// base_qcomp.hpp before implementing compensation.
+                    cpu_qcomp product = elem * cache[j];
+
+                    #ifdef QUEST_DENSE_ACCUM_NAIVE
+                    // original uncompensated accumulation
+                    sum += product;
+                    #else
+                    // compensated (Kahan) accumulation; note base_qcomp's +/- are plain
+                    // componentwise real/imaginary arithmetic, so compensation is well-defined
+                    cpu_qcomp corrected = product - compensation;
+                    cpu_qcomp next = sum + corrected;
+                    compensation = (next - sum) - corrected;
+                    sum = next;
+                    #endif
                 }
+
+                amps[i] = sum;
             }
         }
     }
diff --git a/tests/unit/operations.cpp b/tests/unit/operations.cpp
index 80b75b9c2..bfed20d20 100644
--- a/tests/unit/operations.cpp
+++ b/tests/unit/operations.cpp
@@ -1328,6 +1328,41 @@ TEST_ALL_CTRL_OPERATIONS( S,        one, none, FixedMatrices::S );
 TEST_ALL_CTRL_OPERATIONS( Swap,     two, none, FixedMatrices::SWAP );
 TEST_ALL_CTRL_OPERATIONS( SqrtSwap, two, none, FixedMatrices::sqrtSWAP );
 TEST_ALL_CTRL_OPERATIONS( RotateX, one, scalar, ParameterisedMatrices::Rx );
+
+TEST_CASE( "applyCompMatr uses compensated accumulation", TEST_CATEGORY_OPS ) {
+
+    int numQubits = 4;
+    qindex numAmps = getPow2(numQubits);
+
+    Qureg qureg = createQureg(numQubits);
+    qvector amps(numAmps, 1);
+    setQuregAmps(qureg, 0, amps.data(), numAmps);
+
+    qmatrix elems = getZeroMatrix(numAmps);
+    qreal largeVal = 1e16;
+    #if (QUEST_FLOAT_PRECISION == 1)
+    largeVal = 2e7;
+    #endif
+
+    elems[0][0] = largeVal;
+    elems[0][numAmps - 1] = -largeVal;
+
+    for (qindex i=1; i<numAmps-1; i++)
+        elems[0][i] = 1;
+
+    CompMatr matr = createInlineCompMatr(numQubits, elems);
+    int targets[] = {0, 1, 2, 3};
+
+    setQuESTValidationEpsilon(0);
+    applyCompMatr(qureg, targets, numQubits, matr);
+    setQuESTValidationEpsilonToDefault();
+
+    REQUIRE( real(getQuregAmp(qureg, 0)) == 14 );
+
+    destroyCompMatr(matr);
+    destroyQureg(qureg);
+}
+
 TEST_ALL_CTRL_OPERATIONS( RotateY, one, scalar, ParameterisedMatrices::Ry );
 TEST_ALL_CTRL_OPERATIONS( RotateZ, one, scalar, ParameterisedMatrices::Rz );
 TEST_ALL_CTRL_OPERATIONS( RotateAroundAxis, one, axisrots, nullptr );