QuEST-Kit · agentlaunchops-ai · Jun 7, 2026 · Jun 7, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/artifacts/README.md b/artifacts/README.md
@@ -0,0 +1,95 @@
+# Kahan vs naive accumulation in `cpu_statevec_anyCtrlAnyTargDenseMatr_sub()`
+
+Benchmark + analysis for issue #598 / PR #784. Measures the runtime and accuracy
+effect of compensated (Kahan) summation vs the original uncompensated accumulation
+in the CPU dense-matrix inner-product loop, across the three QuEST `qcomp`
+precisions, on a single CPU.
+
+## Files
+
+| file | purpose |
+| --- | --- |
+| `bench_kahan.cpp` | Standalone harness replicating the exact accumulation kernel (naive, Kahan, and a genuine-quad `__float128` reference) for fp1/fp2/fp4. |
+| `plot_kahan.py` | Generates the three figures from `results.csv`. |
+| `e2e_compmatr.cpp` | End-to-end check against the real built QuEST library (Kahan build vs `QUEST_DENSE_ACCUM_NAIVE` build) using the recipe from issue #598. |
+| `results.csv` | Raw benchmark data. |
+| `kahan_accuracy.png`, `kahan_runtime.png`, `kahan_costbenefit.png` | Figures. |
+
+## How the function was made selectable
+
+`cpu_statevec_anyCtrlAnyTargDenseMatr_sub()` now wraps the accumulation in a
+compile-time switch. The default is compensated (Kahan) summation (preserving the
+PR behaviour and its regression test); defining `QUEST_DENSE_ACCUM_NAIVE` reverts to
+the original `sum += elem * cache[j]` for apples-to-apples benchmarking. Example:
+
+```bash
+cmake -S . -B build -D QUEST_FLOAT_PRECISION=2 -D QUEST_ENABLE_OMP=OFF -D QUEST_ENABLE_MPI=OFF
+# naive variant for benchmarking:
+cmake -S . -B build_naive ... -D CMAKE_CXX_FLAGS="-DQUEST_DENSE_ACCUM_NAIVE"
+```
+
+## Reproduce
+
+```bash
+# benchmark (GCC needed for __float128 genuine-quad reference)
+g++-15 -O3 -std=c++17 bench_kahan.cpp -o bench_kahan -lquadmath
+./bench_kahan > results.csv 2> host_info.txt
+python3 plot_kahan.py
+```
+
+## The three-precision reality on this host (IMPORTANT)
+
+On this Apple-Silicon macOS host, **both AppleClang and Homebrew GCC report
+`sizeof(long double) == 8` with a 53-bit mantissa** — i.e. `long double` is just
+`double`. QuEST's fp4 build (which uses `long double`) is therefore **NOT a genuine
+quadruple-precision build here**, and its accuracy is identical to fp2. This is
+visible in `results.csv`: every fp4 error equals the corresponding fp2 error exactly.
+
+Consequently:
+- The fp1/fp2/fp4 **runtime** numbers are all valid (they are what those builds
+  actually do on this machine).
+- For the **accuracy ground truth** we do NOT use `long double`. We use GCC's
+  `__float128` (113-bit mantissa, genuine quad, via quadmath) as the reference and
+  measure each precision's error against it.
+
+## Key results (worst-case abs error vs 113-bit `__float128` reference)
+
+Adversarial ill-conditioned matrix (entries spanning 1e-6 .. 1e6). `ms` is per
+single-CPU `applyCompMatr`; lower is better.
+
+| precision | targets | err naive | err Kahan | err reduction | ms naive | ms Kahan | slowdown |
+| --- | ---: | --- | --- | ---: | ---: | ---: | ---: |
+| fp1 | 8  | 1.57e5 | 4.90e4 | 3.2x | 0.038 | 0.141 | 3.7x |
+| fp1 | 14 | 9.54e6 | 3.21e5 | 30x  | 184   | 611   | 3.3x |
+| fp2 | 8  | 3.49e-4 | 5.28e-5 | 6.6x | 0.032 | 0.133 | 4.2x |
+| fp2 | 14 | 3.29e-2 | 6.54e-4 | 50x  | 175   | 608   | 3.5x |
+| fp4 | 14 | 3.29e-2 | 6.54e-4 | 50x  | 173   | 608   | 3.5x |  *(== fp2; not true quad)*
+
+## Conclusion
+
+- **Accuracy benefit grows with the number of targets**, exactly as the issue
+  anticipated: negligible at 2–4 targets (no cancellation yet), then 1–2 orders of
+  magnitude error reduction by 12–14 targets, for both single and double precision.
+- **Cost is roughly constant at ~3.3–4.2x slowdown** at every non-trivial size *in
+  this in-cache micro-benchmark*. Here the matrix+cache stay resident across `reps`,
+  so the loop is compute-bound (4 complex ops per term vs 1) and the overhead does not
+  amortise — it is *not* "time-free". Caveat: the real per-apply cost at large target
+  counts (e.g. a 16384x16384 fp2 matrix is ~4 GB) is memory-bandwidth bound, so the
+  *relative* Kahan cost in production may be smaller than this micro-benchmark shows;
+  treat 3.3–4.2x as a standalone-kernel upper bound, not a real-library figure.
+- **End-to-end** the modified function turns a fully cancelled `amp0 = 0` (naive) into
+  `amp0 = 4096` (Kahan) for a 12-target adversarial `CompMatr` whose exact answer is
+  `numRows-2 = 4094` — i.e. naive loses the result entirely while Kahan recovers it to
+  within +2 (a residual O(1) error after subtracting 1e18-scale terms).
+
+So Kahan is worthwhile for large or ill-conditioned dense matrices where the
+accuracy matters, but a blanket 3–4x slowdown is too costly for small/well-conditioned
+matrices. A size threshold (e.g. enable Kahan only above ~6–8 targets) is the natural
+compromise; the `QUEST_DENSE_ACCUM_NAIVE` toggle is the simplest hook for that.
+
+Note on scope of the benefit: the 30x/50x error reductions above are for a
+deliberately adversarial, non-unitary matrix (entries spanning 1e-6 .. 1e6 with
+cancelling large pairs). Real `CompMatr` inputs are typically unitary with O(1)
+entries, where cancellation — and hence the benefit — is far milder at any target
+count. The threshold argument is therefore a worst-case bound, not the expected
+benefit on well-conditioned operators.
diff --git a/artifacts/bench_kahan.cpp b/artifacts/bench_kahan.cpp
@@ -0,0 +1,221 @@
+/*
+ * Cost/benefit benchmark for compensated (Kahan) vs naive accumulation in the
+ * dense-matrix inner-product kernel of cpu_statevec_anyCtrlAnyTargDenseMatr_sub().
+ *
+ * This harness replicates the accumulation performed by that QuEST function:
+ * each output amplitude is  sum_j  matr[k][j] * cache[j], where j ranges over the
+ * 2^numTargets matrix columns. We reproduce that serial inner loop once with naive
+ * summation and once with complex Kahan summation. NOTE: the harness uses
+ * std::complex<Real> rather than QuEST's base_qcomp type; the two are arithmetically
+ * identical here (both perform plain componentwise IEEE real/imaginary arithmetic, so
+ * complex Kahan degenerates to two independent real Kahan sums), but this is not
+ * literally QuEST's struct. The real-library accuracy claims (regression test, e2e
+ * driver) exercise the actual base_qcomp type. Precisions:
+ *
+ *     fp1  ->  std::complex<float>       (24-bit mantissa)
+ *     fp2  ->  std::complex<double>      (53-bit mantissa)
+ *     fp4  ->  std::complex<long double> (QuEST's "quad" precision)
+ *
+ * IMPORTANT HONESTY NOTE on fp4:
+ *   On this Apple-Silicon macOS host, BOTH AppleClang and Homebrew GCC report
+ *   sizeof(long double)==8 with a 53-bit mantissa -- i.e. long double IS double.
+ *   Therefore QuEST's fp4 build is NOT a genuine quadruple-precision build here,
+ *   and fp4 accuracy is expected to equal fp2. We report fp4 anyway (it is what a
+ *   user compiling QUEST_FLOAT_PRECISION=4 on this machine actually gets) but flag
+ *   it as not-truly-quad.
+ *
+ *   For the ACCURACY GROUND-TRUTH we therefore do NOT use long double. We use GCC's
+ *   __float128 (113-bit mantissa, genuine quad) via quadmath as the reference, and
+ *   measure each precision's error against that genuinely-higher-precision result.
+ *
+ * Build (GCC required for __float128 / quadmath):
+ *   g++-15 -O3 -std=c++17 bench_kahan.cpp -o bench_kahan -lquadmath
+ */
+
+#include <complex>
+#include <vector>
+#include <random>
+#include <chrono>
+#include <cstdio>
+#include <cmath>
+#include <quadmath.h>
+
+using clk = std::chrono::steady_clock;
+
+// ---- genuine-quad complex (reference) built on __float128 -------------------
+struct qcomp128 {
+    __float128 re, im;
+};
+static inline qcomp128 operator*(qcomp128 a, qcomp128 b) {
+    return { a.re*b.re - a.im*b.im, a.re*b.im + a.im*b.re };
+}
+static inline qcomp128 operator+(qcomp128 a, qcomp128 b) {
+    return { a.re+b.re, a.im+b.im };
+}
+
+// naive accumulation, templated on the working complex type T (== std::complex<...>)
+template <typename T, typename Elem, typename Cache>
+T accumulate_naive(const Elem& row, const Cache& cache, long dim) {
+    T sum(0, 0);
+    for (long j = 0; j < dim; j++)
+        sum += row[j] * cache[j];
+    return sum;
+}
+
+// compensated (Kahan) accumulation -- mirrors the QuEST function body exactly
+template <typename T, typename Elem, typename Cache>
+T accumulate_kahan(const Elem& row, const Cache& cache, long dim) {
+    T sum(0, 0);
+    T compensation(0, 0);
+    for (long j = 0; j < dim; j++) {
+        T product     = row[j] * cache[j];
+        T corrected   = product - compensation;
+        T next        = sum + corrected;
+        compensation  = (next - sum) - corrected;
+        sum           = next;
+    }
+    return sum;
+}
+
+// reference accumulation in genuine quad (113-bit) -- naive is fine since the
+// reference type so vastly out-precisions the candidates that its own rounding
+// is negligible relative to fp1/fp2 error.
+qcomp128 accumulate_ref(const std::vector<qcomp128>& row, const std::vector<qcomp128>& cache, long dim) {
+    qcomp128 sum{0, 0};
+    for (long j = 0; j < dim; j++)
+        sum = sum + (row[j] * cache[j]);
+    return sum;
+}
+
+template <typename Real>
+struct Result {
+    double abs_err_naive;   // |naive   - ref|
+    double abs_err_kahan;   // |kahan   - ref|
+    double ms_naive;        // ms per full apply (all output amps)
+    double ms_kahan;
+};
+
+// Run one (precision, numTargets) cell. We build a random unitary-ish dense matrix
+// and a random input vector, in genuine quad, then DOWN-CAST to the working precision
+// so that naive/kahan see identically-rounded inputs and differ only in the sum.
+template <typename Real>
+Result<Real> run_cell(int numTargets, int reps, unsigned seed) {
+    using T = std::complex<Real>;
+    const long dim = 1L << numTargets;          // matrix is dim x dim, vector length dim
+
+    std::mt19937_64 rng(seed);
+    std::uniform_real_distribution<double> uni(-1.0, 1.0);
+
+    // reference-precision matrix (row-major) and input cache
+    std::vector<qcomp128> matr128((size_t)dim * dim);
+    std::vector<qcomp128> cache128(dim);
+
+    // To exercise cancellation realistically, scale entries across many magnitudes:
+    // mix O(1) terms with a few very large +/- pairs that must cancel.
+    for (long i = 0; i < dim; i++) {
+        double mag = std::pow(10.0, uni(rng) * 6.0);   // 1e-6 .. 1e6
+        cache128[i] = { (__float128)(uni(rng) * mag), (__float128)(uni(rng) * mag) };
+    }
+    for (long r = 0; r < dim; r++)
+        for (long c = 0; c < dim; c++) {
+            double mag = std::pow(10.0, uni(rng) * 6.0);
+            matr128[(size_t)r*dim + c] = { (__float128)(uni(rng) * mag), (__float128)(uni(rng) * mag) };
+        }
+
+    // down-cast to working precision
+    std::vector<T> matr((size_t)dim * dim);
+    std::vector<T> cache(dim);
+    for (size_t i = 0; i < matr128.size(); i++)
+        matr[i] = T((Real)(double)matr128[i].re, (Real)(double)matr128[i].im);
+    for (long i = 0; i < dim; i++)
+        cache[i] = T((Real)(double)cache128[i].re, (Real)(double)cache128[i].im);
+
+    // genuine-quad reference for every output amp
+    std::vector<qcomp128> ref(dim);
+    {
+        std::vector<qcomp128> rowbuf(dim);
+        for (long k = 0; k < dim; k++) {
+            for (long j = 0; j < dim; j++)
+                rowbuf[j] = matr128[(size_t)k*dim + j];
+            ref[k] = accumulate_ref(rowbuf, cache128, dim);
+        }
+    }
+
+    // helper to view row k of the working-precision matrix
+    auto rowptr = [&](long k) { return &matr[(size_t)k*dim]; };
+
+    // accuracy: worst-case (max over output amps) absolute error vs quad reference
+    double err_naive = 0, err_kahan = 0;
+    for (long k = 0; k < dim; k++) {
+        T sn = accumulate_naive<T>(rowptr(k), cache, dim);
+        T sk = accumulate_kahan<T>(rowptr(k), cache, dim);
+        double dn = std::hypot((double)((__float128)(Real)sn.real() - ref[k].re),
+                               (double)((__float128)(Real)sn.imag() - ref[k].im));
+        double dk = std::hypot((double)((__float128)(Real)sk.real() - ref[k].re),
+                               (double)((__float128)(Real)sk.imag() - ref[k].im));
+        if (dn > err_naive) err_naive = dn;
+        if (dk > err_kahan) err_kahan = dk;
+    }
+
+    // runtime: full apply = compute all dim output amps; repeat for stable timing
+    volatile double sink = 0;
+    auto time_variant = [&](bool kahan) {
+        // warm up
+        for (long k = 0; k < dim; k++) {
+            T s = kahan ? accumulate_kahan<T>(rowptr(k), cache, dim)
+                        : accumulate_naive<T>(rowptr(k), cache, dim);
+            sink += (double)s.real();
+        }
+        auto t0 = clk::now();
+        for (int rep = 0; rep < reps; rep++)
+            for (long k = 0; k < dim; k++) {
+                T s = kahan ? accumulate_kahan<T>(rowptr(k), cache, dim)
+                            : accumulate_naive<T>(rowptr(k), cache, dim);
+                sink += (double)s.real();
+            }
+        auto t1 = clk::now();
+        double total_ms = std::chrono::duration<double, std::milli>(t1 - t0).count();
+        return total_ms / reps;   // ms per full apply
+    };
+
+    Result<Real> res;
+    res.abs_err_naive = err_naive;
+    res.abs_err_kahan = err_kahan;
+    res.ms_naive = time_variant(false);
+    res.ms_kahan = time_variant(true);
+    (void)sink;
+    return res;
+}
+
+template <typename Real>
+void run_precision(const char* label, const int* targs, int ntargs) {
+    for (int t = 0; t < ntargs; t++) {
+        int nt = targs[t];
+        // more reps for tiny sizes to get stable timing; fewer for big sizes
+        long dim = 1L << nt;
+        int reps = (int)std::max(1L, (long)(2'000'000 / (dim * dim)));
+        Result<Real> r = run_cell<Real>(nt, reps, 0xC0FFEEu + nt);
+        // CSV: precision,numTargets,dim,reps,err_naive,err_kahan,ms_naive,ms_kahan
+        printf("%s,%d,%ld,%d,%.6e,%.6e,%.6e,%.6e\n",
+               label, nt, dim, reps,
+               r.abs_err_naive, r.abs_err_kahan, r.ms_naive, r.ms_kahan);
+        fflush(stdout);
+    }
+}
+
+int main() {
+    // report what long double actually is on this host, for the record
+    fprintf(stderr, "host: sizeof(float)=%zu sizeof(double)=%zu sizeof(long double)=%zu "
+                    "ld_mant=%d  sizeof(__float128)=%zu f128_mant=%d\n",
+            sizeof(float), sizeof(double), sizeof(long double),
+            (int)__LDBL_MANT_DIG__, sizeof(__float128), (int)FLT128_MANT_DIG);
+
+    const int targs[] = {2, 4, 6, 8, 10, 12, 13, 14};
+    const int ntargs = sizeof(targs)/sizeof(targs[0]);
+
+    printf("precision,numTargets,dim,reps,err_naive,err_kahan,ms_naive,ms_kahan\n");
+    run_precision<float>       ("fp1", targs, ntargs);  // single
+    run_precision<double>      ("fp2", targs, ntargs);  // double
+    run_precision<long double> ("fp4", targs, ntargs);  // QuEST "quad" = double on this host
+    return 0;
+}
diff --git a/artifacts/e2e_compmatr.cpp b/artifacts/e2e_compmatr.cpp
@@ -0,0 +1,39 @@
+// End-to-end check against the REAL QuEST library (the actual modified function),
+// adapted from the recipe in issue #598. Applies a large all-ones-ish CompMatr to a
+// fixed random state and reports amp[0] at high precision. Linked once against the
+// default (Kahan) build and once against the QUEST_DENSE_ACCUM_NAIVE build to show
+// the modified function's accuracy difference end-to-end.
+#include "quest.h"
+#include <cstdio>
+#include <vector>
+
+int main() {
+    initQuESTEnv();
+    int numTargets = 12;
+    int targets[] = {0,1,2,3,4,5,6,7,8,9,10,11};
+
+    CompMatr matrix = createCompMatr(numTargets);
+    // adversarial row 0: large +/- pair that cancels, plus many O(1) terms.
+    for (qindex i=0; i<matrix.numRows; i++)
+        for (qindex j=0; j<matrix.numRows; j++)
+            matrix.cpuElems[i][j] = 1;
+    matrix.cpuElems[0][0] = 1e18;
+    matrix.cpuElems[0][matrix.numRows-1] = -1e18;
+    syncCompMatr(matrix);
+
+    Qureg qureg = createQureg(numTargets);
+    // fixed deterministic state: all amps = 1 (unnormalised; validation off)
+    std::vector<qcomp> ones(qureg.numAmps, qcomp(1,0));
+    setQuregAmps(qureg, 0, ones);
+
+    setQuESTValidationEpsilon(0);
+    applyCompMatr(qureg, targets, numTargets, matrix);
+    qcomp amp = getQuregAmp(qureg, 0);
+    // expected exact amp[0] = (1e18) + (numRows-2)*1 + (-1e18) = numRows-2
+    printf("amp0.real = %.20g   (exact = %lld)\n", (double)real(amp), (long long)matrix.numRows-2);
+
+    destroyCompMatr(matrix);
+    destroyQureg(qureg);
+    finalizeQuESTEnv();
+    return 0;
+}
diff --git a/artifacts/host_info.txt b/artifacts/host_info.txt
@@ -0,0 +1 @@
+host: sizeof(float)=4 sizeof(double)=8 sizeof(long double)=8 ld_mant=53  sizeof(__float128)=16 f128_mant=113
diff --git a/artifacts/kahan_accuracy.png b/artifacts/kahan_accuracy.png
diff --git a/artifacts/kahan_costbenefit.png b/artifacts/kahan_costbenefit.png
diff --git a/artifacts/kahan_runtime.png b/artifacts/kahan_runtime.png
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		host: sizeof(float)=4 sizeof(double)=8 sizeof(long double)=8 ld_mant=53 sizeof(__float128)=16 f128_mant=113