Skip to content

Fuse distributed prefix-suffix multi-SWAPs into one all-to-all round (GPU + controlled)#790

Open
thedaemon-wizard wants to merge 1 commit into
QuEST-Kit:develfrom
thedaemon-wizard:swap-fusion-595
Open

Fuse distributed prefix-suffix multi-SWAPs into one all-to-all round (GPU + controlled)#790
thedaemon-wizard wants to merge 1 commit into
QuEST-Kit:develfrom
thedaemon-wizard:swap-fusion-595

Conversation

@thedaemon-wizard

@thedaemon-wizard thedaemon-wizard commented Jun 11, 2026

Copy link
Copy Markdown

Fuse distributed prefix↔suffix multi-SWAPs into a single all-to-all round (GPU + controlled)

Closes #595

Summary

QuEST relocates prefix (global / inter-node) qubits into the suffix (local)
region before applying a many-target dense matrix (applyCompMatr,
applyMultiQubitMatr) or a partial trace on a distributed register. This
"cache blocking" is performed by the internal
anyCtrlMultiSwapBetweenPrefixAndSuffix, whose own @todo asks for the
k disjoint prefix↔suffix SWAPs to be fused: historically they ran one at a
time, so an amplitude relocated to another node by one SWAP is relocated again
by the next — k communication rounds and up to k/2·N amplitudes moved.

This PR replaces that loop with a single personalized all-to-all exchange
over the 2ᵏ-rank subcube, sending each amplitude directly to its final node
in one round
(cf. cuStateVec distributed index-bit swap, mpiQulacs
[arXiv:2203.16044], [arXiv:2311.01512] §IV C–D, [arXiv:quant-ph/0608239]).

It is deliberately scoped to complement, not duplicate, the two excellent
CPU-only PRs already open against this issue:

This PR contributes exactly the pieces those two leave open, fully compiled
and tested on real hardware:

  1. GPU (CUDA) pack/unpack kernels for the fused swap — compiled and tested
    on an NVIDIA Blackwell sm_120 GPU (CUDA 13.0)
    , the precise piece Fuse distributed prefix-suffix multi-SWAP (closes #595) #785 could
    not build.
  2. The controlled path — fusion is applied with arbitrary control qubits /
    states intact (both competitors keep the controlled case on the slow
    per-swap path).
  3. A true single all-to-all round (batched Isend/Irecv + one MPI_Waitall),
    with a persistent, lazily-grown staging workspace mirroring QuEST's existing
    gpuCache and the workspaces of cuStateVec / mpiQulacs.

If the maintainers prefer #786 for the CPU core, the GPU kernels and the
controlled-path handling here can be rebased on top of it.

What changed

anyCtrlMultiSwapBetweenPrefixAndSuffix now collects the disjoint
prefix↔suffix pairs and:

  • k = 0 → returns; k = 1 → delegates to the existing single-swap routine
    (zero behavioural change, no staging buffer);
  • k ≥ 2 → calls the new fusedMultiSwapBetweenPrefixAndSuffix.

The fused algorithm (sub-block transpose over a 2ᵏ subcube)

Ranks differing from ours only in the swapped prefix qubits' rank-bits form a
2ᵏ-rank subcube. Label a rank by its k-bit address a. A local amplitude
j with swapped-suffix pattern v maps to rank v, new suffix pattern a:

  • the block with v == a (own address) stays put, untouched — no work;
  • for each of the 2ᵏ−1 partner ranks, the local amplitudes whose suffix
    pattern equals the partner's address are packed, sent, and the received block
    is written back into the same local indices;
  • the 2ᵏ−1 index sets are disjoint, so all exchanges run concurrently in
    one round
    .

Total moved = (1 − 2⁻ᵏ)·N (vs sequential k/2·N); peak extra memory is one
staging buffer ≤ N per node, reused across calls and freed at
finalizeQuESTEnv.

Files

File Change
core/localiser.cpp rewrite entry point; add fusedMultiSwapBetweenPrefixAndSuffix; retain multiSwapSequentially as reference + QUEST_DISABLE_SWAP_FUSION benchmark toggle
comm/comm_routines.{cpp,hpp} comm_exchangeAmpsToBuffersForFusedSwap (CPU / GPU-direct / GPU-staged) + exchangeArraysWithMultiplePartners (single Isend/Irecv batch + one MPI_Waitall)
cpu/cpu_subroutines.{cpp,hpp} OpenMP pack/unpack (templated on #targs)
gpu/gpu_subroutines.{cpp,hpp} CUDA pack/unpack reusing the existing, tested kernel_statevec_packAmpsIntoBuffer / kernel_statevec_anyCtrlSwap_subB
core/accelerator.{cpp,hpp} CPU/GPU dispatch + persistent staging-workspace cache
api/environment.cpp free the staging workspace at teardown

No public API change; the four existing call sites are untouched.

Correctness

Disjoint single-pair swaps commute (already relied upon here), so the fused
result is identical to the sequential one. The QUEST_DISABLE_SWAP_FUSION=1
toggle runs the old sequential path for direct comparison. Density-matrix
callers (partial trace) pass ket+bra targets exactly as before; controls pass
straight through.

Verified against QuEST's deployment-independent brute-force reference
(tests/utils), which QUEST_TEST_TRY_ALL_DEPLOYMENTS runs across
serial / OMP / GPU / MPI:

Deployment Filter Result
Serial + OpenMP *PartialTrace*,*CompMatr* pass
CUDA sm_120 calcPartialTrace,applyCompMatr2,applyCompMatr All passed (20 780 assertions)
CPU + MPI, np = 2 calcPartialTrace,applyCompMatr2,applyCompMatr,applySwap All passed (7 647)
CPU + MPI, np = 4 (fuses k = 2) same All passed (3 333)
CPU + MPI, np = 8 (fuses k = 3) same All passed (1 179)
CPU + MPI + OMP, np = 4, 3 threads same All passed (6 637)
fused vs sequential toggle, np = 4 calcPartialTrace,applyCompMatr2 both All passed (1 148)
GPU + MPI compiles cleanly

Performance

Micro-benchmark (benchmarks/): applyCompMatr (identity, to isolate
movement) on a forced-distributed 24-qubit statevector; k = log2(np) global
qubits are swapped in each call. Intel i5-13600K, OpenMPI 4.1.1, shared-memory
transport:

np k fused µs/call sequential µs/call Δ data moved fused / seq
4 2 349 204 323 770 +7.9 % 0.75 N / 1.0 N
8 3 377 673 377 410 ~0 % 0.875 N / 1.5 N
16 4 305 346 343 795 −11.2 % 0.94 N / 2.0 N

The sequential cost grows linearly in k; fused stays at one round and its
moved-data saturates at N. Even on a single node — where the round-count
(latency) advantage is entirely absent because a "round" is a local memcpy —
fused already wins once k ≥ 3–4. On real multi-node interconnects, the
per-round network-latency savings make fusion advantageous at every k
(consistent with the cited literature). benchmarks/README.md analyses this in
full.

Notes for reviewers

  • k ≤ 1 is byte-for-byte the old path; the staging buffer and all new comm
    only engage for k ≥ 2 distributed registers.
  • The GPU fused comm-staging path is compiled & the CUDA kernels are
    numerically tested on sm_120; multi-rank GPU runtime is limited here to one
    physical GPU, so the multi-node GPU path is compile-verified while the
    multi-node CPU path is runtime-verified at np = 1/2/4/8.
  • The disabled meta-control optimisation at the old call site is left disabled.

AI-assistance disclosure

Per the unitaryHACK AI policy: I used an AI
coding assistant (Anthropic Claude) to help draft the fused-swap algorithm
derivation, scaffold the pack/unpack kernels by mirroring QuEST's existing
kernel_statevec_packAmpsIntoBuffer / kernel_statevec_anyCtrlSwap_subB, and
to help write this description and the benchmark harness. I then manually
reviewed, verified, and tested
every change myself:

  • I can explain each modified routine and the sub-block-transpose derivation in a
    live review.
  • All changes were compiled and run on real hardware — CPU, OpenMP, CUDA
    sm_120 (Blackwell, CUDA 13.0)
    , and CPU+MPI at np = 1/2/4/8 — and validated
    against QuEST's deployment-independent brute-force reference (the pass counts
    in the Correctness table above are my own runs).
  • The fused-vs-sequential equivalence and the performance numbers were produced
    by me on the machine described in the benchmark section, not generated text.

No part of this PR was submitted unverified straight from an LLM.

…(GPU + controlled)

Replace the sequential per-swap loop in anyCtrlMultiSwapBetweenPrefixAndSuffix
with a single personalized all-to-all exchange over the 2^k-rank subcube,
sending each amplitude directly to its final node in one communication round
instead of relocating it once per swap.

- localiser: fusedMultiSwapBetweenPrefixAndSuffix (k>=2); k<=1 unchanged;
  retain multiSwapSequentially + QUEST_DISABLE_SWAP_FUSION benchmark toggle
- comm: comm_exchangeAmpsToBuffersForFusedSwap (CPU / GPU-direct / GPU-staged)
  + exchangeArraysWithMultiplePartners (batched Isend/Irecv + one Waitall)
- cpu/gpu: OpenMP and CUDA pack/unpack reusing existing tested kernels
- accelerator: dispatch + persistent lazily-grown staging workspace
- environment: free the staging workspace at finalizeQuESTEnv

Controlled path and GPU kernels included; complements CPU-only PRs QuEST-Kit#785/QuEST-Kit#786.
Verified on CPU, OpenMP, CUDA sm_120, and MPI (np 1/2/4/8); GPU+MPI compiles.

Closes QuEST-Kit#595
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant