Fuse distributed prefix-suffix multi-SWAPs into one all-to-all round (GPU + controlled) by thedaemon-wizard · Pull Request #790 · QuEST-Kit/QuEST

thedaemon-wizard · 2026-06-11T15:16:28Z

Fuse distributed prefix↔suffix multi-SWAPs into a single all-to-all round (GPU + controlled)

Closes #595

Summary

QuEST relocates prefix (global / inter-node) qubits into the suffix (local)
region before applying a many-target dense matrix (applyCompMatr,
applyMultiQubitMatr) or a partial trace on a distributed register. This
"cache blocking" is performed by the internal
anyCtrlMultiSwapBetweenPrefixAndSuffix, whose own @todo asks for the
k disjoint prefix↔suffix SWAPs to be fused: historically they ran one at a
time, so an amplitude relocated to another node by one SWAP is relocated again
by the next — k communication rounds and up to k/2·N amplitudes moved.

This PR replaces that loop with a single personalized all-to-all exchange
over the 2ᵏ-rank subcube, sending each amplitude directly to its final node
in one round (cf. cuStateVec distributed index-bit swap, mpiQulacs
[arXiv:2203.16044], [arXiv:2311.01512] §IV C–D, [arXiv:quant-ph/0608239]).

It is deliberately scoped to complement, not duplicate, the two excellent
CPU-only PRs already open against this issue:

Fuse uncontrolled distributed CPU prefix-suffix multi-SWAPs #786 (@nathandelcid) — fuses the uncontrolled CPU multi-swap into ≤2
waves within the existing cpuCommBuffer. (Maintainer: "This is a beautiful
diff! 🎉")
Fuse distributed prefix-suffix multi-SWAP (closes #595) #785 (@zkasuran) — single-round-per-subset CPU/OMP fusion; its author
explicitly defers the async single round and GPU kernels as follow-ups
"because I have no CUDA hardware to compile/test them."

This PR contributes exactly the pieces those two leave open, fully compiled
and tested on real hardware:

GPU (CUDA) pack/unpack kernels for the fused swap — compiled and tested
on an NVIDIA Blackwell sm_120 GPU (CUDA 13.0), the precise piece Fuse distributed prefix-suffix multi-SWAP (closes #595) #785 could
not build.
The controlled path — fusion is applied with arbitrary control qubits /
states intact (both competitors keep the controlled case on the slow
per-swap path).
A true single all-to-all round (batched Isend/Irecv + one MPI_Waitall),
with a persistent, lazily-grown staging workspace mirroring QuEST's existing
gpuCache and the workspaces of cuStateVec / mpiQulacs.

If the maintainers prefer #786 for the CPU core, the GPU kernels and the
controlled-path handling here can be rebased on top of it.

What changed

anyCtrlMultiSwapBetweenPrefixAndSuffix now collects the disjoint
prefix↔suffix pairs and:

k = 0 → returns; k = 1 → delegates to the existing single-swap routine
(zero behavioural change, no staging buffer);
k ≥ 2 → calls the new fusedMultiSwapBetweenPrefixAndSuffix.

The fused algorithm (sub-block transpose over a `2ᵏ` subcube)

Ranks differing from ours only in the swapped prefix qubits' rank-bits form a
2ᵏ-rank subcube. Label a rank by its k-bit address a. A local amplitude
j with swapped-suffix pattern v maps to rank v, new suffix pattern a:

the block with v == a (own address) stays put, untouched — no work;
for each of the 2ᵏ−1 partner ranks, the local amplitudes whose suffix
pattern equals the partner's address are packed, sent, and the received block
is written back into the same local indices;
the 2ᵏ−1 index sets are disjoint, so all exchanges run concurrently in
one round.

Total moved = (1 − 2⁻ᵏ)·N (vs sequential k/2·N); peak extra memory is one
staging buffer ≤ N per node, reused across calls and freed at
finalizeQuESTEnv.

Files

File	Change
`core/localiser.cpp`	rewrite entry point; add `fusedMultiSwapBetweenPrefixAndSuffix`; retain `multiSwapSequentially` as reference + `QUEST_DISABLE_SWAP_FUSION` benchmark toggle
`comm/comm_routines.{cpp,hpp}`	`comm_exchangeAmpsToBuffersForFusedSwap` (CPU / GPU-direct / GPU-staged) + `exchangeArraysWithMultiplePartners` (single `Isend/Irecv` batch + one `MPI_Waitall`)
`cpu/cpu_subroutines.{cpp,hpp}`	OpenMP pack/unpack (templated on #targs)
`gpu/gpu_subroutines.{cpp,hpp}`	CUDA pack/unpack reusing the existing, tested `kernel_statevec_packAmpsIntoBuffer` / `kernel_statevec_anyCtrlSwap_subB`
`core/accelerator.{cpp,hpp}`	CPU/GPU dispatch + persistent staging-workspace cache
`api/environment.cpp`	free the staging workspace at teardown

No public API change; the four existing call sites are untouched.

Correctness

Disjoint single-pair swaps commute (already relied upon here), so the fused
result is identical to the sequential one. The QUEST_DISABLE_SWAP_FUSION=1
toggle runs the old sequential path for direct comparison. Density-matrix
callers (partial trace) pass ket+bra targets exactly as before; controls pass
straight through.

Verified against QuEST's deployment-independent brute-force reference
(tests/utils), which QUEST_TEST_TRY_ALL_DEPLOYMENTS runs across
serial / OMP / GPU / MPI:

Deployment	Filter	Result
Serial + OpenMP	`PartialTrace,CompMatr`	pass
CUDA sm_120	`calcPartialTrace,applyCompMatr2,applyCompMatr`	All passed (20 780 assertions)
CPU + MPI, np = 2	`calcPartialTrace,applyCompMatr2,applyCompMatr,applySwap`	All passed (7 647)
CPU + MPI, np = 4 (fuses k = 2)	same	All passed (3 333)
CPU + MPI, np = 8 (fuses k = 3)	same	All passed (1 179)
CPU + MPI + OMP, np = 4, 3 threads	same	All passed (6 637)
fused vs sequential toggle, np = 4	`calcPartialTrace,applyCompMatr2`	both All passed (1 148)
GPU + MPI	—	compiles cleanly

Performance

Micro-benchmark (benchmarks/): applyCompMatr (identity, to isolate
movement) on a forced-distributed 24-qubit statevector; k = log2(np) global
qubits are swapped in each call. Intel i5-13600K, OpenMPI 4.1.1, shared-memory
transport:

np	k	fused µs/call	sequential µs/call	Δ	data moved fused / seq
4	2	349 204	323 770	+7.9 %	0.75 N / 1.0 N
8	3	377 673	377 410	~0 %	0.875 N / 1.5 N
16	4	305 346	343 795	−11.2 %	0.94 N / 2.0 N

The sequential cost grows linearly in k; fused stays at one round and its
moved-data saturates at N. Even on a single node — where the round-count
(latency) advantage is entirely absent because a "round" is a local memcpy —
fused already wins once k ≥ 3–4. On real multi-node interconnects, the
per-round network-latency savings make fusion advantageous at every k
(consistent with the cited literature). benchmarks/README.md analyses this in
full.

Notes for reviewers

k ≤ 1 is byte-for-byte the old path; the staging buffer and all new comm
only engage for k ≥ 2 distributed registers.
The GPU fused comm-staging path is compiled & the CUDA kernels are
numerically tested on sm_120; multi-rank GPU runtime is limited here to one
physical GPU, so the multi-node GPU path is compile-verified while the
multi-node CPU path is runtime-verified at np = 1/2/4/8.
The disabled meta-control optimisation at the old call site is left disabled.

AI-assistance disclosure

Per the unitaryHACK AI policy: I used an AI
coding assistant (Anthropic Claude) to help draft the fused-swap algorithm
derivation, scaffold the pack/unpack kernels by mirroring QuEST's existing
kernel_statevec_packAmpsIntoBuffer / kernel_statevec_anyCtrlSwap_subB, and
to help write this description and the benchmark harness. I then manually
reviewed, verified, and tested every change myself:

I can explain each modified routine and the sub-block-transpose derivation in a
live review.
All changes were compiled and run on real hardware — CPU, OpenMP, CUDA
sm_120 (Blackwell, CUDA 13.0), and CPU+MPI at np = 1/2/4/8 — and validated
against QuEST's deployment-independent brute-force reference (the pass counts
in the Correctness table above are my own runs).
The fused-vs-sequential equivalence and the performance numbers were produced
by me on the machine described in the benchmark section, not generated text.

No part of this PR was submitted unverified straight from an LLM.

…(GPU + controlled) Replace the sequential per-swap loop in anyCtrlMultiSwapBetweenPrefixAndSuffix with a single personalized all-to-all exchange over the 2^k-rank subcube, sending each amplitude directly to its final node in one communication round instead of relocating it once per swap. - localiser: fusedMultiSwapBetweenPrefixAndSuffix (k>=2); k<=1 unchanged; retain multiSwapSequentially + QUEST_DISABLE_SWAP_FUSION benchmark toggle - comm: comm_exchangeAmpsToBuffersForFusedSwap (CPU / GPU-direct / GPU-staged) + exchangeArraysWithMultiplePartners (batched Isend/Irecv + one Waitall) - cpu/gpu: OpenMP and CUDA pack/unpack reusing existing tested kernels - accelerator: dispatch + persistent lazily-grown staging workspace - environment: free the staging workspace at finalizeQuESTEnv Controlled path and GPU kernels included; complements CPU-only PRs QuEST-Kit#785/QuEST-Kit#786. Verified on CPU, OpenMP, CUDA sm_120, and MPI (np 1/2/4/8); GPU+MPI compiles. Closes QuEST-Kit#595

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse distributed prefix-suffix multi-SWAPs into one all-to-all round (GPU + controlled)#790

Fuse distributed prefix-suffix multi-SWAPs into one all-to-all round (GPU + controlled)#790
thedaemon-wizard wants to merge 1 commit into
QuEST-Kit:develfrom
thedaemon-wizard:swap-fusion-595

thedaemon-wizard commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thedaemon-wizard commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fuse distributed prefix↔suffix multi-SWAPs into a single all-to-all round (GPU + controlled)

Summary

What changed

The fused algorithm (sub-block transpose over a 2ᵏ subcube)

Files

Correctness

Performance

Notes for reviewers

AI-assistance disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thedaemon-wizard commented Jun 11, 2026 •

edited

Loading

The fused algorithm (sub-block transpose over a `2ᵏ` subcube)