diff --git a/artifacts/README.md b/artifacts/README.md new file mode 100644 index 000000000..1a2f14d40 --- /dev/null +++ b/artifacts/README.md @@ -0,0 +1,95 @@ +# Kahan vs naive accumulation in `cpu_statevec_anyCtrlAnyTargDenseMatr_sub()` + +Benchmark + analysis for issue #598 / PR #784. Measures the runtime and accuracy +effect of compensated (Kahan) summation vs the original uncompensated accumulation +in the CPU dense-matrix inner-product loop, across the three QuEST `qcomp` +precisions, on a single CPU. + +## Files + +| file | purpose | +| --- | --- | +| `bench_kahan.cpp` | Standalone harness replicating the exact accumulation kernel (naive, Kahan, and a genuine-quad `__float128` reference) for fp1/fp2/fp4. | +| `plot_kahan.py` | Generates the three figures from `results.csv`. | +| `e2e_compmatr.cpp` | End-to-end check against the real built QuEST library (Kahan build vs `QUEST_DENSE_ACCUM_NAIVE` build) using the recipe from issue #598. | +| `results.csv` | Raw benchmark data. | +| `kahan_accuracy.png`, `kahan_runtime.png`, `kahan_costbenefit.png` | Figures. | + +## How the function was made selectable + +`cpu_statevec_anyCtrlAnyTargDenseMatr_sub()` now wraps the accumulation in a +compile-time switch. The default is compensated (Kahan) summation (preserving the +PR behaviour and its regression test); defining `QUEST_DENSE_ACCUM_NAIVE` reverts to +the original `sum += elem * cache[j]` for apples-to-apples benchmarking. Example: + +```bash +cmake -S . -B build -D QUEST_FLOAT_PRECISION=2 -D QUEST_ENABLE_OMP=OFF -D QUEST_ENABLE_MPI=OFF +# naive variant for benchmarking: +cmake -S . -B build_naive ... -D CMAKE_CXX_FLAGS="-DQUEST_DENSE_ACCUM_NAIVE" +``` + +## Reproduce + +```bash +# benchmark (GCC needed for __float128 genuine-quad reference) +g++-15 -O3 -std=c++17 bench_kahan.cpp -o bench_kahan -lquadmath +./bench_kahan > results.csv 2> host_info.txt +python3 plot_kahan.py +``` + +## The three-precision reality on this host (IMPORTANT) + +On this Apple-Silicon macOS host, **both AppleClang and Homebrew GCC report +`sizeof(long double) == 8` with a 53-bit mantissa** — i.e. `long double` is just +`double`. QuEST's fp4 build (which uses `long double`) is therefore **NOT a genuine +quadruple-precision build here**, and its accuracy is identical to fp2. This is +visible in `results.csv`: every fp4 error equals the corresponding fp2 error exactly. + +Consequently: +- The fp1/fp2/fp4 **runtime** numbers are all valid (they are what those builds + actually do on this machine). +- For the **accuracy ground truth** we do NOT use `long double`. We use GCC's + `__float128` (113-bit mantissa, genuine quad, via quadmath) as the reference and + measure each precision's error against it. + +## Key results (worst-case abs error vs 113-bit `__float128` reference) + +Adversarial ill-conditioned matrix (entries spanning 1e-6 .. 1e6). `ms` is per +single-CPU `applyCompMatr`; lower is better. + +| precision | targets | err naive | err Kahan | err reduction | ms naive | ms Kahan | slowdown | +| --- | ---: | --- | --- | ---: | ---: | ---: | ---: | +| fp1 | 8 | 1.57e5 | 4.90e4 | 3.2x | 0.038 | 0.141 | 3.7x | +| fp1 | 14 | 9.54e6 | 3.21e5 | 30x | 184 | 611 | 3.3x | +| fp2 | 8 | 3.49e-4 | 5.28e-5 | 6.6x | 0.032 | 0.133 | 4.2x | +| fp2 | 14 | 3.29e-2 | 6.54e-4 | 50x | 175 | 608 | 3.5x | +| fp4 | 14 | 3.29e-2 | 6.54e-4 | 50x | 173 | 608 | 3.5x | *(== fp2; not true quad)* + +## Conclusion + +- **Accuracy benefit grows with the number of targets**, exactly as the issue + anticipated: negligible at 2–4 targets (no cancellation yet), then 1–2 orders of + magnitude error reduction by 12–14 targets, for both single and double precision. +- **Cost is roughly constant at ~3.3–4.2x slowdown** at every non-trivial size *in + this in-cache micro-benchmark*. Here the matrix+cache stay resident across `reps`, + so the loop is compute-bound (4 complex ops per term vs 1) and the overhead does not + amortise — it is *not* "time-free". Caveat: the real per-apply cost at large target + counts (e.g. a 16384x16384 fp2 matrix is ~4 GB) is memory-bandwidth bound, so the + *relative* Kahan cost in production may be smaller than this micro-benchmark shows; + treat 3.3–4.2x as a standalone-kernel upper bound, not a real-library figure. +- **End-to-end** the modified function turns a fully cancelled `amp0 = 0` (naive) into + `amp0 = 4096` (Kahan) for a 12-target adversarial `CompMatr` whose exact answer is + `numRows-2 = 4094` — i.e. naive loses the result entirely while Kahan recovers it to + within +2 (a residual O(1) error after subtracting 1e18-scale terms). + +So Kahan is worthwhile for large or ill-conditioned dense matrices where the +accuracy matters, but a blanket 3–4x slowdown is too costly for small/well-conditioned +matrices. A size threshold (e.g. enable Kahan only above ~6–8 targets) is the natural +compromise; the `QUEST_DENSE_ACCUM_NAIVE` toggle is the simplest hook for that. + +Note on scope of the benefit: the 30x/50x error reductions above are for a +deliberately adversarial, non-unitary matrix (entries spanning 1e-6 .. 1e6 with +cancelling large pairs). Real `CompMatr` inputs are typically unitary with O(1) +entries, where cancellation — and hence the benefit — is far milder at any target +count. The threshold argument is therefore a worst-case bound, not the expected +benefit on well-conditioned operators. diff --git a/artifacts/bench_kahan.cpp b/artifacts/bench_kahan.cpp new file mode 100644 index 000000000..fbaf0ce8a --- /dev/null +++ b/artifacts/bench_kahan.cpp @@ -0,0 +1,221 @@ +/* + * Cost/benefit benchmark for compensated (Kahan) vs naive accumulation in the + * dense-matrix inner-product kernel of cpu_statevec_anyCtrlAnyTargDenseMatr_sub(). + * + * This harness replicates the accumulation performed by that QuEST function: + * each output amplitude is sum_j matr[k][j] * cache[j], where j ranges over the + * 2^numTargets matrix columns. We reproduce that serial inner loop once with naive + * summation and once with complex Kahan summation. NOTE: the harness uses + * std::complex rather than QuEST's base_qcomp type; the two are arithmetically + * identical here (both perform plain componentwise IEEE real/imaginary arithmetic, so + * complex Kahan degenerates to two independent real Kahan sums), but this is not + * literally QuEST's struct. The real-library accuracy claims (regression test, e2e + * driver) exercise the actual base_qcomp type. Precisions: + * + * fp1 -> std::complex (24-bit mantissa) + * fp2 -> std::complex (53-bit mantissa) + * fp4 -> std::complex (QuEST's "quad" precision) + * + * IMPORTANT HONESTY NOTE on fp4: + * On this Apple-Silicon macOS host, BOTH AppleClang and Homebrew GCC report + * sizeof(long double)==8 with a 53-bit mantissa -- i.e. long double IS double. + * Therefore QuEST's fp4 build is NOT a genuine quadruple-precision build here, + * and fp4 accuracy is expected to equal fp2. We report fp4 anyway (it is what a + * user compiling QUEST_FLOAT_PRECISION=4 on this machine actually gets) but flag + * it as not-truly-quad. + * + * For the ACCURACY GROUND-TRUTH we therefore do NOT use long double. We use GCC's + * __float128 (113-bit mantissa, genuine quad) via quadmath as the reference, and + * measure each precision's error against that genuinely-higher-precision result. + * + * Build (GCC required for __float128 / quadmath): + * g++-15 -O3 -std=c++17 bench_kahan.cpp -o bench_kahan -lquadmath + */ + +#include +#include +#include +#include +#include +#include +#include + +using clk = std::chrono::steady_clock; + +// ---- genuine-quad complex (reference) built on __float128 ------------------- +struct qcomp128 { + __float128 re, im; +}; +static inline qcomp128 operator*(qcomp128 a, qcomp128 b) { + return { a.re*b.re - a.im*b.im, a.re*b.im + a.im*b.re }; +} +static inline qcomp128 operator+(qcomp128 a, qcomp128 b) { + return { a.re+b.re, a.im+b.im }; +} + +// naive accumulation, templated on the working complex type T (== std::complex<...>) +template +T accumulate_naive(const Elem& row, const Cache& cache, long dim) { + T sum(0, 0); + for (long j = 0; j < dim; j++) + sum += row[j] * cache[j]; + return sum; +} + +// compensated (Kahan) accumulation -- mirrors the QuEST function body exactly +template +T accumulate_kahan(const Elem& row, const Cache& cache, long dim) { + T sum(0, 0); + T compensation(0, 0); + for (long j = 0; j < dim; j++) { + T product = row[j] * cache[j]; + T corrected = product - compensation; + T next = sum + corrected; + compensation = (next - sum) - corrected; + sum = next; + } + return sum; +} + +// reference accumulation in genuine quad (113-bit) -- naive is fine since the +// reference type so vastly out-precisions the candidates that its own rounding +// is negligible relative to fp1/fp2 error. +qcomp128 accumulate_ref(const std::vector& row, const std::vector& cache, long dim) { + qcomp128 sum{0, 0}; + for (long j = 0; j < dim; j++) + sum = sum + (row[j] * cache[j]); + return sum; +} + +template +struct Result { + double abs_err_naive; // |naive - ref| + double abs_err_kahan; // |kahan - ref| + double ms_naive; // ms per full apply (all output amps) + double ms_kahan; +}; + +// Run one (precision, numTargets) cell. We build a random unitary-ish dense matrix +// and a random input vector, in genuine quad, then DOWN-CAST to the working precision +// so that naive/kahan see identically-rounded inputs and differ only in the sum. +template +Result run_cell(int numTargets, int reps, unsigned seed) { + using T = std::complex; + const long dim = 1L << numTargets; // matrix is dim x dim, vector length dim + + std::mt19937_64 rng(seed); + std::uniform_real_distribution uni(-1.0, 1.0); + + // reference-precision matrix (row-major) and input cache + std::vector matr128((size_t)dim * dim); + std::vector cache128(dim); + + // To exercise cancellation realistically, scale entries across many magnitudes: + // mix O(1) terms with a few very large +/- pairs that must cancel. + for (long i = 0; i < dim; i++) { + double mag = std::pow(10.0, uni(rng) * 6.0); // 1e-6 .. 1e6 + cache128[i] = { (__float128)(uni(rng) * mag), (__float128)(uni(rng) * mag) }; + } + for (long r = 0; r < dim; r++) + for (long c = 0; c < dim; c++) { + double mag = std::pow(10.0, uni(rng) * 6.0); + matr128[(size_t)r*dim + c] = { (__float128)(uni(rng) * mag), (__float128)(uni(rng) * mag) }; + } + + // down-cast to working precision + std::vector matr((size_t)dim * dim); + std::vector cache(dim); + for (size_t i = 0; i < matr128.size(); i++) + matr[i] = T((Real)(double)matr128[i].re, (Real)(double)matr128[i].im); + for (long i = 0; i < dim; i++) + cache[i] = T((Real)(double)cache128[i].re, (Real)(double)cache128[i].im); + + // genuine-quad reference for every output amp + std::vector ref(dim); + { + std::vector rowbuf(dim); + for (long k = 0; k < dim; k++) { + for (long j = 0; j < dim; j++) + rowbuf[j] = matr128[(size_t)k*dim + j]; + ref[k] = accumulate_ref(rowbuf, cache128, dim); + } + } + + // helper to view row k of the working-precision matrix + auto rowptr = [&](long k) { return &matr[(size_t)k*dim]; }; + + // accuracy: worst-case (max over output amps) absolute error vs quad reference + double err_naive = 0, err_kahan = 0; + for (long k = 0; k < dim; k++) { + T sn = accumulate_naive(rowptr(k), cache, dim); + T sk = accumulate_kahan(rowptr(k), cache, dim); + double dn = std::hypot((double)((__float128)(Real)sn.real() - ref[k].re), + (double)((__float128)(Real)sn.imag() - ref[k].im)); + double dk = std::hypot((double)((__float128)(Real)sk.real() - ref[k].re), + (double)((__float128)(Real)sk.imag() - ref[k].im)); + if (dn > err_naive) err_naive = dn; + if (dk > err_kahan) err_kahan = dk; + } + + // runtime: full apply = compute all dim output amps; repeat for stable timing + volatile double sink = 0; + auto time_variant = [&](bool kahan) { + // warm up + for (long k = 0; k < dim; k++) { + T s = kahan ? accumulate_kahan(rowptr(k), cache, dim) + : accumulate_naive(rowptr(k), cache, dim); + sink += (double)s.real(); + } + auto t0 = clk::now(); + for (int rep = 0; rep < reps; rep++) + for (long k = 0; k < dim; k++) { + T s = kahan ? accumulate_kahan(rowptr(k), cache, dim) + : accumulate_naive(rowptr(k), cache, dim); + sink += (double)s.real(); + } + auto t1 = clk::now(); + double total_ms = std::chrono::duration(t1 - t0).count(); + return total_ms / reps; // ms per full apply + }; + + Result res; + res.abs_err_naive = err_naive; + res.abs_err_kahan = err_kahan; + res.ms_naive = time_variant(false); + res.ms_kahan = time_variant(true); + (void)sink; + return res; +} + +template +void run_precision(const char* label, const int* targs, int ntargs) { + for (int t = 0; t < ntargs; t++) { + int nt = targs[t]; + // more reps for tiny sizes to get stable timing; fewer for big sizes + long dim = 1L << nt; + int reps = (int)std::max(1L, (long)(2'000'000 / (dim * dim))); + Result r = run_cell(nt, reps, 0xC0FFEEu + nt); + // CSV: precision,numTargets,dim,reps,err_naive,err_kahan,ms_naive,ms_kahan + printf("%s,%d,%ld,%d,%.6e,%.6e,%.6e,%.6e\n", + label, nt, dim, reps, + r.abs_err_naive, r.abs_err_kahan, r.ms_naive, r.ms_kahan); + fflush(stdout); + } +} + +int main() { + // report what long double actually is on this host, for the record + fprintf(stderr, "host: sizeof(float)=%zu sizeof(double)=%zu sizeof(long double)=%zu " + "ld_mant=%d sizeof(__float128)=%zu f128_mant=%d\n", + sizeof(float), sizeof(double), sizeof(long double), + (int)__LDBL_MANT_DIG__, sizeof(__float128), (int)FLT128_MANT_DIG); + + const int targs[] = {2, 4, 6, 8, 10, 12, 13, 14}; + const int ntargs = sizeof(targs)/sizeof(targs[0]); + + printf("precision,numTargets,dim,reps,err_naive,err_kahan,ms_naive,ms_kahan\n"); + run_precision ("fp1", targs, ntargs); // single + run_precision ("fp2", targs, ntargs); // double + run_precision ("fp4", targs, ntargs); // QuEST "quad" = double on this host + return 0; +} diff --git a/artifacts/e2e_compmatr.cpp b/artifacts/e2e_compmatr.cpp new file mode 100644 index 000000000..0682f762f --- /dev/null +++ b/artifacts/e2e_compmatr.cpp @@ -0,0 +1,39 @@ +// End-to-end check against the REAL QuEST library (the actual modified function), +// adapted from the recipe in issue #598. Applies a large all-ones-ish CompMatr to a +// fixed random state and reports amp[0] at high precision. Linked once against the +// default (Kahan) build and once against the QUEST_DENSE_ACCUM_NAIVE build to show +// the modified function's accuracy difference end-to-end. +#include "quest.h" +#include +#include + +int main() { + initQuESTEnv(); + int numTargets = 12; + int targets[] = {0,1,2,3,4,5,6,7,8,9,10,11}; + + CompMatr matrix = createCompMatr(numTargets); + // adversarial row 0: large +/- pair that cancels, plus many O(1) terms. + for (qindex i=0; i ones(qureg.numAmps, qcomp(1,0)); + setQuregAmps(qureg, 0, ones); + + setQuESTValidationEpsilon(0); + applyCompMatr(qureg, targets, numTargets, matrix); + qcomp amp = getQuregAmp(qureg, 0); + // expected exact amp[0] = (1e18) + (numRows-2)*1 + (-1e18) = numRows-2 + printf("amp0.real = %.20g (exact = %lld)\n", (double)real(amp), (long long)matrix.numRows-2); + + destroyCompMatr(matrix); + destroyQureg(qureg); + finalizeQuESTEnv(); + return 0; +} diff --git a/artifacts/host_info.txt b/artifacts/host_info.txt new file mode 100644 index 000000000..54aaa4fbf --- /dev/null +++ b/artifacts/host_info.txt @@ -0,0 +1 @@ +host: sizeof(float)=4 sizeof(double)=8 sizeof(long double)=8 ld_mant=53 sizeof(__float128)=16 f128_mant=113 diff --git a/artifacts/kahan_accuracy.png b/artifacts/kahan_accuracy.png new file mode 100644 index 000000000..1e2a4cf50 Binary files /dev/null and b/artifacts/kahan_accuracy.png differ diff --git a/artifacts/kahan_costbenefit.png b/artifacts/kahan_costbenefit.png new file mode 100644 index 000000000..8d5e145bb Binary files /dev/null and b/artifacts/kahan_costbenefit.png differ diff --git a/artifacts/kahan_runtime.png b/artifacts/kahan_runtime.png new file mode 100644 index 000000000..94d7800b2 Binary files /dev/null and b/artifacts/kahan_runtime.png differ diff --git a/artifacts/plot_kahan.py b/artifacts/plot_kahan.py new file mode 100644 index 000000000..ce23ee39a --- /dev/null +++ b/artifacts/plot_kahan.py @@ -0,0 +1,100 @@ +#!/usr/bin/env python3 +""" +Visualise the cost/benefit of compensated (Kahan) vs naive accumulation in +cpu_statevec_anyCtrlAnyTargDenseMatr_sub(), per qcomp precision. + +Input : results.csv (produced by bench_kahan) +Output: kahan_accuracy.png, kahan_runtime.png, kahan_costbenefit.png +""" +import csv +import matplotlib +matplotlib.use("Agg") +import matplotlib.pyplot as plt + +rows = [] +with open("results.csv") as f: + for r in csv.DictReader(f): + rows.append(r) + +precs = ["fp1", "fp2", "fp4"] +labels = { + "fp1": "fp1 (float, 24-bit)", + "fp2": "fp2 (double, 53-bit)", + "fp4": "fp4* (long double = double here, 53-bit)", +} +colors = {"fp1": "#d62728", "fp2": "#1f77b4", "fp4": "#2ca02c"} + +def series(p, key): + xs, ys = [], [] + for r in rows: + if r["precision"] == p: + xs.append(int(r["numTargets"])) + ys.append(float(r[key])) + return xs, ys + +# ---------------- accuracy ---------------- +fig, ax = plt.subplots(figsize=(8, 5.5)) +for p in precs: + x, yn = series(p, "err_naive") + _, yk = series(p, "err_kahan") + ax.plot(x, yn, "--o", color=colors[p], label=f"{labels[p]} - naive", alpha=0.9) + ax.plot(x, yk, "-s", color=colors[p], label=f"{labels[p]} - Kahan", alpha=0.9) +ax.set_yscale("log") +ax.set_xlabel("number of target qubits (matrix is 2^n x 2^n)") +ax.set_ylabel("worst-case abs error vs __float128 (113-bit) reference") +ax.set_title("Accuracy: Kahan vs naive dense-matrix accumulation\n" + "(adversarial ill-conditioned matrix, single-CPU)") +ax.grid(True, which="both", alpha=0.3) +ax.legend(fontsize=7, loc="upper left") +fig.tight_layout() +fig.savefig("kahan_accuracy.png", dpi=130) + +# ---------------- runtime ---------------- +fig, ax = plt.subplots(figsize=(8, 5.5)) +for p in precs: + x, tn = series(p, "ms_naive") + _, tk = series(p, "ms_kahan") + ax.plot(x, tn, "--o", color=colors[p], label=f"{labels[p]} - naive") + ax.plot(x, tk, "-s", color=colors[p], label=f"{labels[p]} - Kahan") +ax.set_yscale("log") +ax.set_xlabel("number of target qubits (matrix is 2^n x 2^n)") +ax.set_ylabel("runtime per applyCompMatr [ms] (single CPU, -O3)") +ax.set_title("Runtime cost: Kahan vs naive dense-matrix accumulation") +ax.grid(True, which="both", alpha=0.3) +ax.legend(fontsize=7, loc="upper left") +fig.tight_layout() +fig.savefig("kahan_runtime.png", dpi=130) + +# ---------------- combined cost/benefit ---------------- +fig, (a1, a2) = plt.subplots(1, 2, figsize=(13, 5.5)) +for p in precs: + x, yn = series(p, "err_naive") + _, yk = series(p, "err_kahan") + # accuracy benefit = error reduction factor (naive/kahan) + benefit = [(n / k) if k > 0 else 1.0 for n, k in zip(yn, yk)] + a1.plot(x, benefit, "-o", color=colors[p], label=labels[p]) +a1.axhline(1.0, color="k", lw=0.8, ls=":") +a1.set_yscale("log") +a1.set_xlabel("number of target qubits") +a1.set_ylabel("accuracy BENEFIT = err_naive / err_kahan (>1 = Kahan better)") +a1.set_title("Benefit: error-reduction factor from Kahan") +a1.grid(True, which="both", alpha=0.3) +a1.legend(fontsize=8) + +for p in precs: + x, tn = series(p, "ms_naive") + _, tk = series(p, "ms_kahan") + cost = [k / n if n > 0 else 1.0 for n, k in zip(tn, tk)] + a2.plot(x, cost, "-s", color=colors[p], label=labels[p]) +a2.axhline(1.0, color="k", lw=0.8, ls=":") +a2.set_xlabel("number of target qubits") +a2.set_ylabel("runtime COST = ms_kahan / ms_naive (>1 = Kahan slower)") +a2.set_title("Cost: runtime slowdown from Kahan") +a2.grid(True, alpha=0.3) +a2.legend(fontsize=8) +fig.suptitle("Cost / Benefit of Kahan summation in cpu_statevec_anyCtrlAnyTargDenseMatr_sub()", + fontsize=12) +fig.tight_layout(rect=[0, 0, 1, 0.96]) +fig.savefig("kahan_costbenefit.png", dpi=130) + +print("wrote kahan_accuracy.png, kahan_runtime.png, kahan_costbenefit.png") diff --git a/artifacts/results.csv b/artifacts/results.csv new file mode 100644 index 000000000..0e9c9c580 --- /dev/null +++ b/artifacts/results.csv @@ -0,0 +1,25 @@ +precision,numTargets,dim,reps,err_naive,err_kahan,ms_naive,ms_kahan +fp1,2,4,125000,5.772414e-01,5.772414e-01,8.144000e-06,1.000000e-05 +fp1,4,16,7812,5.412803e+02,3.793770e+02,1.137993e-04,2.629288e-04 +fp1,6,64,488,6.479917e+04,4.669267e+04,2.247951e-03,7.772541e-03 +fp1,8,256,30,1.572572e+05,4.900399e+04,3.790000e-02,1.410667e-01 +fp1,10,1024,1,5.970935e+05,1.327381e+05,6.660000e-01,2.438000e+00 +fp1,12,4096,1,2.296930e+06,2.200285e+05,1.123600e+01,3.754900e+01 +fp1,13,8192,1,4.791550e+06,2.534412e+05,4.654500e+01,1.447900e+02 +fp1,14,16384,1,9.542947e+06,3.213748e+05,1.843500e+02,6.110560e+02 +fp2,2,4,125000,5.681711e-10,5.681711e-10,6.760000e-06,9.752000e-06 +fp2,4,16,7812,1.122881e-06,1.122881e-06,1.017665e-04,1.899642e-04 +fp2,6,64,488,1.455275e-04,8.359876e-05,1.639344e-03,5.977459e-03 +fp2,8,256,30,3.492193e-04,5.282792e-05,3.206667e-02,1.330667e-01 +fp2,10,1024,1,1.960401e-03,1.953018e-04,6.750000e-01,2.257000e+00 +fp2,12,4096,1,1.055728e-02,3.008565e-04,1.093100e+01,3.787300e+01 +fp2,13,8192,1,2.110199e-02,4.374301e-04,4.323500e+01,1.475820e+02 +fp2,14,16384,1,3.292284e-02,6.538236e-04,1.751520e+02,6.084400e+02 +fp4,2,4,125000,5.681711e-10,5.681711e-10,7.256000e-06,9.960000e-06 +fp4,4,16,7812,1.122881e-06,1.122881e-06,1.086790e-04,1.980287e-04 +fp4,6,64,488,1.455275e-04,8.359876e-05,1.657787e-03,6.247951e-03 +fp4,8,256,30,3.492193e-04,5.282792e-05,3.270000e-02,1.344667e-01 +fp4,10,1024,1,1.960401e-03,1.953018e-04,6.720000e-01,2.146000e+00 +fp4,12,4096,1,1.055728e-02,3.008565e-04,1.061000e+01,3.800100e+01 +fp4,13,8192,1,2.110199e-02,4.374301e-04,4.505800e+01,1.536850e+02 +fp4,14,16384,1,3.292284e-02,6.538236e-04,1.731930e+02,6.080500e+02 diff --git a/quest/src/cpu/cpu_subroutines.cpp b/quest/src/cpu/cpu_subroutines.cpp index 59df946e9..a52131f0d 100644 --- a/quest/src/cpu/cpu_subroutines.cpp +++ b/quest/src/cpu/cpu_subroutines.cpp @@ -607,8 +607,17 @@ void cpu_statevec_anyCtrlAnyTargDenseMatr_sub(Qureg qureg, ConstList64 ctrls, Co // i = nth local index where ctrls are active and targs form value k qindex i = setBits(i0, targs.data(), numTargBits, k); // loop may be unrolled - amps[i] = getCpuQcomp(0, 0); - + cpu_qcomp sum = getCpuQcomp(0, 0); + + // the inner-product accumulation below uses compensated (Kahan) summation to + // reduce catastrophic cancellation when combining the (up to 2^16) cache amps. + // The running compensation 'c' tracks the low-order bits lost by 'sum += ...'. + // Defining QUEST_DENSE_ACCUM_NAIVE reverts to the original uncompensated sum, + // enabling apples-to-apples runtime/accuracy benchmarking of the two variants. + #ifndef QUEST_DENSE_ACCUM_NAIVE + cpu_qcomp compensation = getCpuQcomp(0, 0); + #endif + // loop may be unrolled for (qindex j=0; j