Fuse distributed prefix-suffix multi-SWAPs into one all-to-all round (GPU + controlled)#790
Open
thedaemon-wizard wants to merge 1 commit into
Open
Fuse distributed prefix-suffix multi-SWAPs into one all-to-all round (GPU + controlled)#790thedaemon-wizard wants to merge 1 commit into
thedaemon-wizard wants to merge 1 commit into
Conversation
…(GPU + controlled) Replace the sequential per-swap loop in anyCtrlMultiSwapBetweenPrefixAndSuffix with a single personalized all-to-all exchange over the 2^k-rank subcube, sending each amplitude directly to its final node in one communication round instead of relocating it once per swap. - localiser: fusedMultiSwapBetweenPrefixAndSuffix (k>=2); k<=1 unchanged; retain multiSwapSequentially + QUEST_DISABLE_SWAP_FUSION benchmark toggle - comm: comm_exchangeAmpsToBuffersForFusedSwap (CPU / GPU-direct / GPU-staged) + exchangeArraysWithMultiplePartners (batched Isend/Irecv + one Waitall) - cpu/gpu: OpenMP and CUDA pack/unpack reusing existing tested kernels - accelerator: dispatch + persistent lazily-grown staging workspace - environment: free the staging workspace at finalizeQuESTEnv Controlled path and GPU kernels included; complements CPU-only PRs QuEST-Kit#785/QuEST-Kit#786. Verified on CPU, OpenMP, CUDA sm_120, and MPI (np 1/2/4/8); GPU+MPI compiles. Closes QuEST-Kit#595
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fuse distributed prefix↔suffix multi-SWAPs into a single all-to-all round (GPU + controlled)
Closes #595
Summary
QuEST relocates prefix (global / inter-node) qubits into the suffix (local)
region before applying a many-target dense matrix (
applyCompMatr,applyMultiQubitMatr) or a partial trace on a distributed register. This"cache blocking" is performed by the internal
anyCtrlMultiSwapBetweenPrefixAndSuffix, whose own@todoasks for thekdisjoint prefix↔suffix SWAPs to be fused: historically they ran one at atime, so an amplitude relocated to another node by one SWAP is relocated again
by the next —
kcommunication rounds and up tok/2·Namplitudes moved.This PR replaces that loop with a single personalized all-to-all exchange
over the
2ᵏ-rank subcube, sending each amplitude directly to its final nodein one round (cf. cuStateVec distributed index-bit swap, mpiQulacs
[arXiv:2203.16044], [arXiv:2311.01512] §IV C–D, [arXiv:quant-ph/0608239]).
It is deliberately scoped to complement, not duplicate, the two excellent
CPU-only PRs already open against this issue:
waves within the existing
cpuCommBuffer. (Maintainer: "This is a beautifuldiff! 🎉")
explicitly defers the async single round and GPU kernels as follow-ups
"because I have no CUDA hardware to compile/test them."
This PR contributes exactly the pieces those two leave open, fully compiled
and tested on real hardware:
on an NVIDIA Blackwell sm_120 GPU (CUDA 13.0), the precise piece Fuse distributed prefix-suffix multi-SWAP (closes #595) #785 could
not build.
states intact (both competitors keep the controlled case on the slow
per-swap path).
Isend/Irecv+ oneMPI_Waitall),with a persistent, lazily-grown staging workspace mirroring QuEST's existing
gpuCacheand the workspaces of cuStateVec / mpiQulacs.If the maintainers prefer #786 for the CPU core, the GPU kernels and the
controlled-path handling here can be rebased on top of it.
What changed
anyCtrlMultiSwapBetweenPrefixAndSuffixnow collects the disjointprefix↔suffix pairs and:
k = 0→ returns;k = 1→ delegates to the existing single-swap routine(zero behavioural change, no staging buffer);
k ≥ 2→ calls the newfusedMultiSwapBetweenPrefixAndSuffix.The fused algorithm (sub-block transpose over a
2ᵏsubcube)Ranks differing from ours only in the swapped prefix qubits' rank-bits form a
2ᵏ-rank subcube. Label a rank by itsk-bit addressa. A local amplitudejwith swapped-suffix patternvmaps to rankv, new suffix patterna:v == a(own address) stays put, untouched — no work;2ᵏ−1partner ranks, the local amplitudes whose suffixpattern equals the partner's address are packed, sent, and the received block
is written back into the same local indices;
2ᵏ−1index sets are disjoint, so all exchanges run concurrently inone round.
Total moved
= (1 − 2⁻ᵏ)·N(vs sequentialk/2·N); peak extra memory is onestaging buffer
≤ Nper node, reused across calls and freed atfinalizeQuESTEnv.Files
core/localiser.cppfusedMultiSwapBetweenPrefixAndSuffix; retainmultiSwapSequentiallyas reference +QUEST_DISABLE_SWAP_FUSIONbenchmark togglecomm/comm_routines.{cpp,hpp}comm_exchangeAmpsToBuffersForFusedSwap(CPU / GPU-direct / GPU-staged) +exchangeArraysWithMultiplePartners(singleIsend/Irecvbatch + oneMPI_Waitall)cpu/cpu_subroutines.{cpp,hpp}gpu/gpu_subroutines.{cpp,hpp}kernel_statevec_packAmpsIntoBuffer/kernel_statevec_anyCtrlSwap_subBcore/accelerator.{cpp,hpp}api/environment.cppNo public API change; the four existing call sites are untouched.
Correctness
Disjoint single-pair swaps commute (already relied upon here), so the fused
result is identical to the sequential one. The
QUEST_DISABLE_SWAP_FUSION=1toggle runs the old sequential path for direct comparison. Density-matrix
callers (partial trace) pass ket+bra targets exactly as before; controls pass
straight through.
Verified against QuEST's deployment-independent brute-force reference
(
tests/utils), whichQUEST_TEST_TRY_ALL_DEPLOYMENTSruns acrossserial / OMP / GPU / MPI:
*PartialTrace*,*CompMatr*calcPartialTrace,applyCompMatr2,applyCompMatrcalcPartialTrace,applyCompMatr2,applyCompMatr,applySwapcalcPartialTrace,applyCompMatr2Performance
Micro-benchmark (
benchmarks/):applyCompMatr(identity, to isolatemovement) on a forced-distributed 24-qubit statevector;
k = log2(np)globalqubits are swapped in each call. Intel i5-13600K, OpenMPI 4.1.1, shared-memory
transport:
The sequential cost grows linearly in
k; fused stays at one round and itsmoved-data saturates at
N. Even on a single node — where the round-count(latency) advantage is entirely absent because a "round" is a local memcpy —
fused already wins once
k ≥ 3–4. On real multi-node interconnects, theper-round network-latency savings make fusion advantageous at every
k(consistent with the cited literature).
benchmarks/README.mdanalyses this infull.
Notes for reviewers
k ≤ 1is byte-for-byte the old path; the staging buffer and all new commonly engage for
k ≥ 2distributed registers.numerically tested on sm_120; multi-rank GPU runtime is limited here to one
physical GPU, so the multi-node GPU path is compile-verified while the
multi-node CPU path is runtime-verified at np = 1/2/4/8.
AI-assistance disclosure
Per the unitaryHACK AI policy: I used an AI
coding assistant (Anthropic Claude) to help draft the fused-swap algorithm
derivation, scaffold the pack/unpack kernels by mirroring QuEST's existing
kernel_statevec_packAmpsIntoBuffer/kernel_statevec_anyCtrlSwap_subB, andto help write this description and the benchmark harness. I then manually
reviewed, verified, and tested every change myself:
live review.
sm_120 (Blackwell, CUDA 13.0), and CPU+MPI at np = 1/2/4/8 — and validated
against QuEST's deployment-independent brute-force reference (the pass counts
in the Correctness table above are my own runs).
by me on the machine described in the benchmark section, not generated text.
No part of this PR was submitted unverified straight from an LLM.