Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions artifacts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Kahan vs naive accumulation in `cpu_statevec_anyCtrlAnyTargDenseMatr_sub()`

Benchmark + analysis for issue #598 / PR #784. Measures the runtime and accuracy
effect of compensated (Kahan) summation vs the original uncompensated accumulation
in the CPU dense-matrix inner-product loop, across the three QuEST `qcomp`
precisions, on a single CPU.

## Files

| file | purpose |
| --- | --- |
| `bench_kahan.cpp` | Standalone harness replicating the exact accumulation kernel (naive, Kahan, and a genuine-quad `__float128` reference) for fp1/fp2/fp4. |
| `plot_kahan.py` | Generates the three figures from `results.csv`. |
| `e2e_compmatr.cpp` | End-to-end check against the real built QuEST library (Kahan build vs `QUEST_DENSE_ACCUM_NAIVE` build) using the recipe from issue #598. |
| `results.csv` | Raw benchmark data. |
| `kahan_accuracy.png`, `kahan_runtime.png`, `kahan_costbenefit.png` | Figures. |

## How the function was made selectable

`cpu_statevec_anyCtrlAnyTargDenseMatr_sub()` now wraps the accumulation in a
compile-time switch. The default is compensated (Kahan) summation (preserving the
PR behaviour and its regression test); defining `QUEST_DENSE_ACCUM_NAIVE` reverts to
the original `sum += elem * cache[j]` for apples-to-apples benchmarking. Example:

```bash
cmake -S . -B build -D QUEST_FLOAT_PRECISION=2 -D QUEST_ENABLE_OMP=OFF -D QUEST_ENABLE_MPI=OFF
# naive variant for benchmarking:
cmake -S . -B build_naive ... -D CMAKE_CXX_FLAGS="-DQUEST_DENSE_ACCUM_NAIVE"
```

## Reproduce

```bash
# benchmark (GCC needed for __float128 genuine-quad reference)
g++-15 -O3 -std=c++17 bench_kahan.cpp -o bench_kahan -lquadmath
./bench_kahan > results.csv 2> host_info.txt
python3 plot_kahan.py
```

## The three-precision reality on this host (IMPORTANT)

On this Apple-Silicon macOS host, **both AppleClang and Homebrew GCC report
`sizeof(long double) == 8` with a 53-bit mantissa** — i.e. `long double` is just
`double`. QuEST's fp4 build (which uses `long double`) is therefore **NOT a genuine
quadruple-precision build here**, and its accuracy is identical to fp2. This is
visible in `results.csv`: every fp4 error equals the corresponding fp2 error exactly.

Consequently:
- The fp1/fp2/fp4 **runtime** numbers are all valid (they are what those builds
actually do on this machine).
- For the **accuracy ground truth** we do NOT use `long double`. We use GCC's
`__float128` (113-bit mantissa, genuine quad, via quadmath) as the reference and
measure each precision's error against it.

## Key results (worst-case abs error vs 113-bit `__float128` reference)

Adversarial ill-conditioned matrix (entries spanning 1e-6 .. 1e6). `ms` is per
single-CPU `applyCompMatr`; lower is better.

| precision | targets | err naive | err Kahan | err reduction | ms naive | ms Kahan | slowdown |
| --- | ---: | --- | --- | ---: | ---: | ---: | ---: |
| fp1 | 8 | 1.57e5 | 4.90e4 | 3.2x | 0.038 | 0.141 | 3.7x |
| fp1 | 14 | 9.54e6 | 3.21e5 | 30x | 184 | 611 | 3.3x |
| fp2 | 8 | 3.49e-4 | 5.28e-5 | 6.6x | 0.032 | 0.133 | 4.2x |
| fp2 | 14 | 3.29e-2 | 6.54e-4 | 50x | 175 | 608 | 3.5x |
| fp4 | 14 | 3.29e-2 | 6.54e-4 | 50x | 173 | 608 | 3.5x | *(== fp2; not true quad)*

## Conclusion

- **Accuracy benefit grows with the number of targets**, exactly as the issue
anticipated: negligible at 2–4 targets (no cancellation yet), then 1–2 orders of
magnitude error reduction by 12–14 targets, for both single and double precision.
- **Cost is roughly constant at ~3.3–4.2x slowdown** at every non-trivial size *in
this in-cache micro-benchmark*. Here the matrix+cache stay resident across `reps`,
so the loop is compute-bound (4 complex ops per term vs 1) and the overhead does not
amortise — it is *not* "time-free". Caveat: the real per-apply cost at large target
counts (e.g. a 16384x16384 fp2 matrix is ~4 GB) is memory-bandwidth bound, so the
*relative* Kahan cost in production may be smaller than this micro-benchmark shows;
treat 3.3–4.2x as a standalone-kernel upper bound, not a real-library figure.
- **End-to-end** the modified function turns a fully cancelled `amp0 = 0` (naive) into
`amp0 = 4096` (Kahan) for a 12-target adversarial `CompMatr` whose exact answer is
`numRows-2 = 4094` — i.e. naive loses the result entirely while Kahan recovers it to
within +2 (a residual O(1) error after subtracting 1e18-scale terms).

So Kahan is worthwhile for large or ill-conditioned dense matrices where the
accuracy matters, but a blanket 3–4x slowdown is too costly for small/well-conditioned
matrices. A size threshold (e.g. enable Kahan only above ~6–8 targets) is the natural
compromise; the `QUEST_DENSE_ACCUM_NAIVE` toggle is the simplest hook for that.

Note on scope of the benefit: the 30x/50x error reductions above are for a
deliberately adversarial, non-unitary matrix (entries spanning 1e-6 .. 1e6 with
cancelling large pairs). Real `CompMatr` inputs are typically unitary with O(1)
entries, where cancellation — and hence the benefit — is far milder at any target
count. The threshold argument is therefore a worst-case bound, not the expected
benefit on well-conditioned operators.
221 changes: 221 additions & 0 deletions artifacts/bench_kahan.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
/*
* Cost/benefit benchmark for compensated (Kahan) vs naive accumulation in the
* dense-matrix inner-product kernel of cpu_statevec_anyCtrlAnyTargDenseMatr_sub().
*
* This harness replicates the accumulation performed by that QuEST function:
* each output amplitude is sum_j matr[k][j] * cache[j], where j ranges over the
* 2^numTargets matrix columns. We reproduce that serial inner loop once with naive
* summation and once with complex Kahan summation. NOTE: the harness uses
* std::complex<Real> rather than QuEST's base_qcomp type; the two are arithmetically
* identical here (both perform plain componentwise IEEE real/imaginary arithmetic, so
* complex Kahan degenerates to two independent real Kahan sums), but this is not
* literally QuEST's struct. The real-library accuracy claims (regression test, e2e
* driver) exercise the actual base_qcomp type. Precisions:
*
* fp1 -> std::complex<float> (24-bit mantissa)
* fp2 -> std::complex<double> (53-bit mantissa)
* fp4 -> std::complex<long double> (QuEST's "quad" precision)
*
* IMPORTANT HONESTY NOTE on fp4:
* On this Apple-Silicon macOS host, BOTH AppleClang and Homebrew GCC report
* sizeof(long double)==8 with a 53-bit mantissa -- i.e. long double IS double.
* Therefore QuEST's fp4 build is NOT a genuine quadruple-precision build here,
* and fp4 accuracy is expected to equal fp2. We report fp4 anyway (it is what a
* user compiling QUEST_FLOAT_PRECISION=4 on this machine actually gets) but flag
* it as not-truly-quad.
*
* For the ACCURACY GROUND-TRUTH we therefore do NOT use long double. We use GCC's
* __float128 (113-bit mantissa, genuine quad) via quadmath as the reference, and
* measure each precision's error against that genuinely-higher-precision result.
*
* Build (GCC required for __float128 / quadmath):
* g++-15 -O3 -std=c++17 bench_kahan.cpp -o bench_kahan -lquadmath
*/

#include <complex>
#include <vector>
#include <random>
#include <chrono>
#include <cstdio>
#include <cmath>
#include <quadmath.h>

using clk = std::chrono::steady_clock;

// ---- genuine-quad complex (reference) built on __float128 -------------------
struct qcomp128 {
__float128 re, im;
};
static inline qcomp128 operator*(qcomp128 a, qcomp128 b) {
return { a.re*b.re - a.im*b.im, a.re*b.im + a.im*b.re };
}
static inline qcomp128 operator+(qcomp128 a, qcomp128 b) {
return { a.re+b.re, a.im+b.im };
}

// naive accumulation, templated on the working complex type T (== std::complex<...>)
template <typename T, typename Elem, typename Cache>
T accumulate_naive(const Elem& row, const Cache& cache, long dim) {
T sum(0, 0);
for (long j = 0; j < dim; j++)
sum += row[j] * cache[j];
return sum;
}

// compensated (Kahan) accumulation -- mirrors the QuEST function body exactly
template <typename T, typename Elem, typename Cache>
T accumulate_kahan(const Elem& row, const Cache& cache, long dim) {
T sum(0, 0);
T compensation(0, 0);
for (long j = 0; j < dim; j++) {
T product = row[j] * cache[j];
T corrected = product - compensation;
T next = sum + corrected;
compensation = (next - sum) - corrected;
sum = next;
}
return sum;
}

// reference accumulation in genuine quad (113-bit) -- naive is fine since the
// reference type so vastly out-precisions the candidates that its own rounding
// is negligible relative to fp1/fp2 error.
qcomp128 accumulate_ref(const std::vector<qcomp128>& row, const std::vector<qcomp128>& cache, long dim) {
qcomp128 sum{0, 0};
for (long j = 0; j < dim; j++)
sum = sum + (row[j] * cache[j]);
return sum;
}

template <typename Real>
struct Result {
double abs_err_naive; // |naive - ref|
double abs_err_kahan; // |kahan - ref|
double ms_naive; // ms per full apply (all output amps)
double ms_kahan;
};

// Run one (precision, numTargets) cell. We build a random unitary-ish dense matrix
// and a random input vector, in genuine quad, then DOWN-CAST to the working precision
// so that naive/kahan see identically-rounded inputs and differ only in the sum.
template <typename Real>
Result<Real> run_cell(int numTargets, int reps, unsigned seed) {
using T = std::complex<Real>;
const long dim = 1L << numTargets; // matrix is dim x dim, vector length dim

std::mt19937_64 rng(seed);
std::uniform_real_distribution<double> uni(-1.0, 1.0);

// reference-precision matrix (row-major) and input cache
std::vector<qcomp128> matr128((size_t)dim * dim);
std::vector<qcomp128> cache128(dim);

// To exercise cancellation realistically, scale entries across many magnitudes:
// mix O(1) terms with a few very large +/- pairs that must cancel.
for (long i = 0; i < dim; i++) {
double mag = std::pow(10.0, uni(rng) * 6.0); // 1e-6 .. 1e6
cache128[i] = { (__float128)(uni(rng) * mag), (__float128)(uni(rng) * mag) };
}
for (long r = 0; r < dim; r++)
for (long c = 0; c < dim; c++) {
double mag = std::pow(10.0, uni(rng) * 6.0);
matr128[(size_t)r*dim + c] = { (__float128)(uni(rng) * mag), (__float128)(uni(rng) * mag) };
}

// down-cast to working precision
std::vector<T> matr((size_t)dim * dim);
std::vector<T> cache(dim);
for (size_t i = 0; i < matr128.size(); i++)
matr[i] = T((Real)(double)matr128[i].re, (Real)(double)matr128[i].im);
for (long i = 0; i < dim; i++)
cache[i] = T((Real)(double)cache128[i].re, (Real)(double)cache128[i].im);

// genuine-quad reference for every output amp
std::vector<qcomp128> ref(dim);
{
std::vector<qcomp128> rowbuf(dim);
for (long k = 0; k < dim; k++) {
for (long j = 0; j < dim; j++)
rowbuf[j] = matr128[(size_t)k*dim + j];
ref[k] = accumulate_ref(rowbuf, cache128, dim);
}
}

// helper to view row k of the working-precision matrix
auto rowptr = [&](long k) { return &matr[(size_t)k*dim]; };

// accuracy: worst-case (max over output amps) absolute error vs quad reference
double err_naive = 0, err_kahan = 0;
for (long k = 0; k < dim; k++) {
T sn = accumulate_naive<T>(rowptr(k), cache, dim);
T sk = accumulate_kahan<T>(rowptr(k), cache, dim);
double dn = std::hypot((double)((__float128)(Real)sn.real() - ref[k].re),
(double)((__float128)(Real)sn.imag() - ref[k].im));
double dk = std::hypot((double)((__float128)(Real)sk.real() - ref[k].re),
(double)((__float128)(Real)sk.imag() - ref[k].im));
if (dn > err_naive) err_naive = dn;
if (dk > err_kahan) err_kahan = dk;
}

// runtime: full apply = compute all dim output amps; repeat for stable timing
volatile double sink = 0;
auto time_variant = [&](bool kahan) {
// warm up
for (long k = 0; k < dim; k++) {
T s = kahan ? accumulate_kahan<T>(rowptr(k), cache, dim)
: accumulate_naive<T>(rowptr(k), cache, dim);
sink += (double)s.real();
}
auto t0 = clk::now();
for (int rep = 0; rep < reps; rep++)
for (long k = 0; k < dim; k++) {
T s = kahan ? accumulate_kahan<T>(rowptr(k), cache, dim)
: accumulate_naive<T>(rowptr(k), cache, dim);
sink += (double)s.real();
}
auto t1 = clk::now();
double total_ms = std::chrono::duration<double, std::milli>(t1 - t0).count();
return total_ms / reps; // ms per full apply
};

Result<Real> res;
res.abs_err_naive = err_naive;
res.abs_err_kahan = err_kahan;
res.ms_naive = time_variant(false);
res.ms_kahan = time_variant(true);
(void)sink;
return res;
}

template <typename Real>
void run_precision(const char* label, const int* targs, int ntargs) {
for (int t = 0; t < ntargs; t++) {
int nt = targs[t];
// more reps for tiny sizes to get stable timing; fewer for big sizes
long dim = 1L << nt;
int reps = (int)std::max(1L, (long)(2'000'000 / (dim * dim)));
Result<Real> r = run_cell<Real>(nt, reps, 0xC0FFEEu + nt);
// CSV: precision,numTargets,dim,reps,err_naive,err_kahan,ms_naive,ms_kahan
printf("%s,%d,%ld,%d,%.6e,%.6e,%.6e,%.6e\n",
label, nt, dim, reps,
r.abs_err_naive, r.abs_err_kahan, r.ms_naive, r.ms_kahan);
fflush(stdout);
}
}

int main() {
// report what long double actually is on this host, for the record
fprintf(stderr, "host: sizeof(float)=%zu sizeof(double)=%zu sizeof(long double)=%zu "
"ld_mant=%d sizeof(__float128)=%zu f128_mant=%d\n",
sizeof(float), sizeof(double), sizeof(long double),
(int)__LDBL_MANT_DIG__, sizeof(__float128), (int)FLT128_MANT_DIG);

const int targs[] = {2, 4, 6, 8, 10, 12, 13, 14};
const int ntargs = sizeof(targs)/sizeof(targs[0]);

printf("precision,numTargets,dim,reps,err_naive,err_kahan,ms_naive,ms_kahan\n");
run_precision<float> ("fp1", targs, ntargs); // single
run_precision<double> ("fp2", targs, ntargs); // double
run_precision<long double> ("fp4", targs, ntargs); // QuEST "quad" = double on this host
return 0;
}
39 changes: 39 additions & 0 deletions artifacts/e2e_compmatr.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
// End-to-end check against the REAL QuEST library (the actual modified function),
// adapted from the recipe in issue #598. Applies a large all-ones-ish CompMatr to a
// fixed random state and reports amp[0] at high precision. Linked once against the
// default (Kahan) build and once against the QUEST_DENSE_ACCUM_NAIVE build to show
// the modified function's accuracy difference end-to-end.
#include "quest.h"
#include <cstdio>
#include <vector>

int main() {
initQuESTEnv();
int numTargets = 12;
int targets[] = {0,1,2,3,4,5,6,7,8,9,10,11};

CompMatr matrix = createCompMatr(numTargets);
// adversarial row 0: large +/- pair that cancels, plus many O(1) terms.
for (qindex i=0; i<matrix.numRows; i++)
for (qindex j=0; j<matrix.numRows; j++)
matrix.cpuElems[i][j] = 1;
matrix.cpuElems[0][0] = 1e18;
matrix.cpuElems[0][matrix.numRows-1] = -1e18;
syncCompMatr(matrix);

Qureg qureg = createQureg(numTargets);
// fixed deterministic state: all amps = 1 (unnormalised; validation off)
std::vector<qcomp> ones(qureg.numAmps, qcomp(1,0));
setQuregAmps(qureg, 0, ones);

setQuESTValidationEpsilon(0);
applyCompMatr(qureg, targets, numTargets, matrix);
qcomp amp = getQuregAmp(qureg, 0);
// expected exact amp[0] = (1e18) + (numRows-2)*1 + (-1e18) = numRows-2
printf("amp0.real = %.20g (exact = %lld)\n", (double)real(amp), (long long)matrix.numRows-2);

destroyCompMatr(matrix);
destroyQureg(qureg);
finalizeQuESTEnv();
return 0;
}
1 change: 1 addition & 0 deletions artifacts/host_info.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
host: sizeof(float)=4 sizeof(double)=8 sizeof(long double)=8 ld_mant=53 sizeof(__float128)=16 f128_mant=113
Binary file added artifacts/kahan_accuracy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added artifacts/kahan_costbenefit.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added artifacts/kahan_runtime.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading