Skip to content

Fuse distributed prefix-suffix multi-SWAP (closes #595)#785

Open
zkasuran wants to merge 1 commit into
QuEST-Kit:develfrom
zkasuran:swap-fusion
Open

Fuse distributed prefix-suffix multi-SWAP (closes #595)#785
zkasuran wants to merge 1 commit into
QuEST-Kit:develfrom
zkasuran:swap-fusion

Conversation

@zkasuran

@zkasuran zkasuran commented Jun 8, 2026

Copy link
Copy Markdown

Summary

Closes #595. Fuses the distributed prefix<->suffix multi-SWAP so each amplitude crosses the network at most once.

When a multi-qubit gate targets qubits that live in the prefix (the index bits that select which node holds an amplitude), QuEST first swaps those qubits down into the suffix. The localiser did this one SWAP at a time, so an amplitude moved by the first SWAP was often moved again by the next, crossing the network several times. This change works out each amplitude's final node up front and sends it there directly.

The SWAPs in such a group act on disjoint qubit pairs, so they commute and compose into a single permutation of the index bits, which is what makes the direct routing well defined. For the uncontrolled case (every internal caller: applyCompMatr, applyCompMatr2, the partial-trace path) the routine enumerates the up to 2^eta - 1 destination nodes, one per non-empty subset of the eta prefix targets whose partnered suffix bit disagrees with this node's rank bit. For each it packs, exchanges and unpacks only the amplitudes bound there. The move is an involution between paired nodes, so packed and unpacked amplitudes sit in the same local slots.

Design notes

This follows the two constraints from the issue thread:

  • The amplitudes are physically moved to their final node. There is no virtual/physical wire-ordering layer.
  • Each amplitude crosses the network at most once, not async-overlapped per-SWAP exchanges.

New CPU kernel cpu_statevec_unpackAmpsFromBuffer is the inverse of the existing cpu_statevec_packAmpsIntoBuffer: an OpenMP scatter that writes the contiguous received sub-buffer back into the strided local amplitudes selected by several constrained qubits, via insertBitsWithMaskedValues, so it loops over O(amplitudes moved) and never over O(2^N).

Scope

CPU/OpenMP, which the issue notes is sufficient. GPU quregs and controlled multi-SWAPs keep the existing per-SWAP path, so the GPU build and its numerics are untouched. A GPU mirror of the kernel is written and ready as a follow-up, kept out of this PR because I have no CUDA hardware to compile it on and did not want this change to risk the GPU build.

Files: core/localiser.cpp, core/accelerator.cpp, core/accelerator.hpp, cpu/cpu_subroutines.cpp, cpu/cpu_subroutines.hpp.

Correctness

The fused routine must give bit-identical results to the per-SWAP path. The existing suites compare against an independent reference state and pass at 1, 2, 4 and 8 ranks:

np=1  All tests passed (21017 assertions in 4 test cases)
np=2  All tests passed (15265 assertions in 4 test cases)
np=4  All tests passed (6637 assertions in 4 test cases)
np=8  All tests passed (2329 assertions in 4 test cases)

(applySwap, applyCompMatr, applyCompMatr2, calcPartialTrace.)

Benchmark

Communication volume (exact, hardware independent). Measured by tallying amplitudes pushed through the sub-buffer exchange:

ranks eta fused / baseline reduction
4 2 0.750 25.0%
8 3 0.583 41.7%

The fused group moves a fraction 1 - 1/2^eta of the partition once, where the per-swap path relays it across eta exchanges (each moving half a partition, applied twice over the down-and-back swap). So the ratio is 2(1 - 1/2^eta)/eta: 3/4 at eta=2 and 7/12 ~ 0.583 at eta=3, matching the table.

Wall clock. I do not have a cluster, so I cannot measure the real multi-node speedup directly. On a single box, intra-node MPI is shared memory with no bandwidth limit, so the saved volume costs nothing and the routine is slower there. To measure the bandwidth-limited regime that distribution actually runs in, I forced MPICH off its shared-memory shortcut onto the TCP transport (MPIR_CVAR_NOLOCAL=1 FI_PROVIDER=tcp), which gives a genuine bandwidth-limited path between ranks on one machine. This is an emulation, not a real cluster. I flag it as such.

Single thread per node (mt=0, the case the issue asks to verify), eta=3, the speedup grows with state size as the saved volume starts to outweigh the fused routine's extra rounds:

state n total state baseline fused speedup fused faster in
27 2 GB 4.262 s 4.181 s +1.9% 4/6 trials
28 4 GB 7.656 s 7.682 s tie 4/6 trials
29 8 GB 17.157 s 16.499 s +3.8% 5/5 trials

At the largest state tested, single threaded, fused is faster on every trial. With OpenMP on (the realistic deployment, where the extra packing is parallelised away) the win is larger and cleaner: n=28, 8 ranks, fused 4.831 s vs baseline 5.425 s, +11%. For eta=2 the 25% volume cut is too small to beat the extra-round overhead single threaded and the result is a tie.

So the extra packing does not outweigh the comm saving: single threaded it is a wash at small state and a win at large state. Once threads or a real (slower than loopback) interconnect enter, it wins across the range. Happy to have this confirmed on a real cluster via CI.

AI disclosure

This change was implemented with substantial help from Claude (Anthropic), which drafted the fused routing in the localiser, the unpack kernel and the benchmark and proposed the subset-enumeration design. I reviewed the approach against the issue thread and the QuEST distributed paper (arXiv:2311.01512). I ran the tests at 1/2/4/8 ranks and the benchmark and own the change. Verified locally before submitting: the CPU/OpenMP suites green at 1/2/4/8 ranks, the comm-volume reduction measured directly and the wall-clock benchmark run under the emulated transport above. The GPU path was not compiled (no CUDA hardware) and is deliberately left out.

The localiser performed each prefix<->suffix SWAP in turn, so an amplitude
moved by one SWAP was often moved again by the next, crossing the network
several times. This fuses the group of disjoint SWAPs into one operation that
computes each amplitude's final node and sends it there directly, so every
amplitude crosses the network at most once.

The disjoint SWAPs commute and compose into a single bit permutation. For the
uncontrolled case (every internal caller) the routine enumerates the up to
2^eta-1 destination nodes and packs, exchanges and unpacks only the amplitudes
bound to each. A new cpu_statevec_unpackAmpsFromBuffer scatters the received
sub-buffer back into the strided local amplitudes, the inverse of the existing
packer, looping over moved amplitudes not the whole state.

Scope is CPU/OpenMP. GPU quregs and controlled multi-SWAPs keep the existing
per-SWAP path, so the GPU build is unchanged.

Comm volume drops 25% at eta=2 and 42% at eta=3 (1 - 1/2^eta), matching theory.
Existing applySwap, applyCompMatr, applyCompMatr2 and calcPartialTrace suites
pass at 1, 2, 4 and 8 ranks.
@TysonRayJones

Copy link
Copy Markdown
Member

Now this is a beautiful diff!! 🎉 🎉 (I just gave that praise elsewhere prematurely ehehe). Are you able to share the code you used for benchmarking? Feel free to paste it here as a comment, or add the file to your diff (we can remove it later).

The results are promising. I expect the packing cost to be totally occluded in settings where local amp movement is much smaller than communication (as is typical). Otherwise, I note here (mostly for posterity) that some further optimizations may be possible. Namely, the serial loop over numSubsets could be parallelized.

Presently, the logic is

numSubsets = powerOf2(numSwaps)

for each subset:

    find pair rank

    pack targeted amps (unique to the pair rank) into buffer

    exchange sub-buffers

    unpack amps from buffer

Since numSwaps <= log2(numProcesses), we know numSubsets <= numProcesses, and ergo this loop may launch as many sequential rounds of communication as exist processes (which is still a huge improvement over the prior method!).
But these rounds of communication need not be sequential.

We could instead...

  1. Pack all to-be-moved amps into the communication buffer, reordered as per their ranks, before proceeding to communication. This can still use the current serial loop (invoking accel_statevec_packAmpsIntoBuffer()), since expected much smaller than the loop inside packAmpsIntoBuffer(), although a bespoke multithreaded function may be faster.

    Beware we can only pack half of the buffer capacity - the remaining half must receive the amps from the pair rank - so we might have to perform minimum two of the above sequential rounds of communication. Needs more thought! Certainly in the general case, if multi-SWAP were exposed to the user and can target any set of qubits (and ergo only prefix qubits), we cannot pack all to-be-communicated amplitudes at once.

  2. Launch each of the numSubsets-many exchanges asynchronously, maximising traffic.
  3. Wait for all exchanges to succeed.
  4. Update all the local amplitudes.

This sees the same total data sent over the network as the current solution, but reduces the number of syncs, and keeps caches "hot". It's then obvious how to optimise this scheme further - avoid the global sync, and have each process start updating one pair rank's worth of amplitudes as soon as it is received. Some or all threads can be dedicated to each, overlapping communication and processing. This has little benefit when the network is always much much slower than local processing (the overlap hides little), but great benefit when the network is liable to saturation and slowdown (as is the current solution, I note). This next layer of optimisation may prove "too deep" an optimisation for QuEST"'s software architecture however 🙏

Overall a great PR! I'll investigate this ASAP

@zkasuran

zkasuran commented Jun 9, 2026

Copy link
Copy Markdown
Author

Thanks! Benchmark code below. Two pieces: the wall-clock driver (what the speedup
numbers come from, compiles against stock QuEST) and the small comm-volume counter I
used to measure the exact amplitude reduction (throwaway, kept out of the diff).

Wall-clock driver

bench_multiswap.cpp, timing applyCompMatr on the top k qubits so all k land
in the prefix and force the fused multi-SWAP. Same binary is linked once against this
branch and once against devel to compare fused vs per-swap.

/* Times applyCompMatr on the highest k qubits, which forces a prefix<->suffix
 * multi-SWAP each call. Link against the fused branch and against devel to compare.
 * QuEST-Kit/QuEST #595. */

#include "quest.h"
#include <vector>
#include <chrono>
#include <cstdio>
#include <cstdlib>
#include <cmath>

static int popcountLong(long x) { int c=0; while (x){ c+=(int)(x&1L); x>>=1; } return c; }

int main(int argc, char** argv) {
    initQuESTEnv();

    int n    = (argc > 1) ? atoi(argv[1]) : 28;  // total qubits
    int k    = (argc > 2) ? atoi(argv[2]) : 4;   // targeted qubits (placed at the top)
    int reps = (argc > 3) ? atoi(argv[3]) : 20;  // timed repetitions
    int mt   = (argc > 4) ? atoi(argv[4]) : 1;   // isMultithreaded (0 = single-thread node)

    Qureg q = createCustomQureg(n, 0, /*isDistrib*/ 1, /*isGpu*/ 0, /*isMultithread*/ mt);
    initZeroState(q);

    // k-qubit Hadamard tensor: a unitary whose square is identity, so repeated
    // application stays normalised while exercising the same communication.
    CompMatr m = createCompMatr(k);
    long dim = 1L << k;
    qreal norm = (qreal) pow(1.0 / sqrt(2.0), k);
    for (long i = 0; i < dim; i++)
        for (long j = 0; j < dim; j++)
            m.cpuElems[i][j] = qcomp((popcountLong(i & j) & 1) ? -norm : norm, 0);
    syncCompMatr(m);

    std::vector<int> targs;            // highest k qubits -> max prefix targets
    for (int i = 0; i < k; i++) targs.push_back(n - 1 - i);

    applyCompMatr(q, targs, m);        // warm up
    syncQuESTEnv();

    auto t0 = std::chrono::steady_clock::now();
    for (int r = 0; r < reps; r++) applyCompMatr(q, targs, m);
    syncQuESTEnv();
    auto t1 = std::chrono::steady_clock::now();

    double secs = std::chrono::duration<double>(t1 - t0).count() / reps;
    QuESTEnv env = getQuESTEnv();
    if (env.rank == 0)
        printf("n=%d k=%d nodes=%d mt=%d reps=%d time_per_apply=%.6f s\n",
               n, k, env.numNodes, mt, reps, secs);

    destroyCompMatr(m);
    destroyQureg(q);
    finalizeQuESTEnv();
    return 0;
}

Run (single box standing in for a bandwidth-limited link, see note):

# fused build vs devel build, single-thread node (mt=0), eta=3 (np=8)
export MPIR_CVAR_NOLOCAL=1 FI_PROVIDER=tcp   # MPICH: treat each rank as a separate
                                             # node + route over TCP, so the saved
                                             # comm volume is actually bandwidth-bound
mpirun -np 8 ./bench_fused    29 4 2 0
mpirun -np 8 ./bench_baseline 29 4 2 0

One caveat I want to be upfront about: I only had a single physical machine, where
intra-node MPI is a shared-memory copy with effectively no bandwidth limit, so the
volume saving is invisible and the extra pack/unpack rounds dominate. To measure the
regime distribution actually targets I forced MPICH off shared memory
(MPIR_CVAR_NOLOCAL=1) and onto the OFI tcp provider, which gives a real
bandwidth-limited TCP path between ranks on one box (~1 GB/s here, comparable to
10 GbE). Correctness is identical; only the transport changes. So the wall-clock
figures are a single-box TCP emulation, not a true cluster. A CI run on real
multi-node hardware would confirm the asymptotic win directly.

Single-thread (mt=0), TCP transport, k=4, eta=3 (np=8), means over repeated trials on
a contended 22-core box (noise ~+-10%):

state n total state baseline (s) fused (s) speedup fused faster in
27 2 GB 4.262 4.181 +1.9% 4/6 trials
28 4 GB 7.656 7.682 -0.3% 4/6 trials
29 8 GB 17.157 16.499 +3.8% 5/5 trials

The win grows with state size: there is a crossover below which the fused routine's
extra rounds cost about what the saved volume buys and above which the volume
reduction dominates and fused is faster even single-threaded. With OpenMP on (mt=1,
the realistic deployment, extra packing parallelised away) it is cleaner: n=28, np=8,
fused 4.831 s vs baseline 5.425 s, +11%. On any link slower than loopback the
crossover moves further down.

Exact comm-volume reduction (throwaway counter)

The volume cut itself is exact and hardware-independent. I measured it with a tiny
counter inside the sub-buffer exchange, which I kept out of the PR diff. To reproduce,
add to quest/src/comm/comm_routines.cpp:

static qindex sub_buffer_amp_tally = 0;
qindex comm_getSubBufferAmpTally()   { return sub_buffer_amp_tally; }
void   comm_resetSubBufferAmpTally() { sub_buffer_amp_tally = 0; }

and one line at the top of comm_exchangeSubBuffers(Qureg, qindex numAmps, int):

    sub_buffer_amp_tally += numAmps;   // benchmark scaffolding only

Tally numAmps per exchange, divide by reps and the per-rank amplitudes moved come
out as (n=26, but the ratio is n-independent):

ranks eta baseline amps/rank fused amps/rank fused/baseline reduction
4 2 33 554 432 25 165 824 0.750 25.0%
8 3 25 165 824 14 680 064 0.583 41.7%

The mechanism: per applyCompMatr the targets are swapped down and back, so the
baseline relays each partition eta times in total (each of the eta per-swap
exchanges moves half a partition, twice over), while the fused routine moves a
fraction 1 - 1/2^eta of the partition exactly once. The ratio is therefore
(1 - 1/2^eta) / (eta/2) = 2(1 - 1/2^eta)/eta: 3/4 at eta=2 and 7/12 ~ 0.583 at
eta=3, matching the measured column. (1 - 1/2^eta on its own is the fraction the
fused step moves, 0.75 and 0.875, not the reduction.)

On the parallelisation notes: agreed and thanks for writing them out. Packing all
to-be-moved amps up front and launching the numSubsets exchanges asynchronously
(then a single wait) is the natural next step, with the half-buffer-capacity
constraint you flag meaning at least two rounds in the general prefix-only case. Happy
to look at that as a follow-up if you'd like it in scope here or leave this PR as the
correctness+volume win and do the async layer separately.

One more note unrelated to the above: the two red Linux [2] CUDA checks are failing in the Setup CUDA (non-MSVC) step (the Jimver/cuda-toolkit action exits with sudo: code 100), before anything is configured or compiled. The Linux [1] CUDA jobs in the same run pass, so it reads as a runner flake rather than the diff (which does not touch the GPU path). A re-run of those two jobs should clear them.

@zkasuran

zkasuran commented Jun 9, 2026

Copy link
Copy Markdown
Author

The two red checks are both Linux [2] CUDA jobs (CUDA and CUDA CUQ), and both failed in the cuda-toolkit install step before anything compiled. The NVIDIA apt mirror was mid-resync when the runner hit it:

E: Failed to fetch http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/Packages  File has unexpected size (1552907 != 1554020). Mirror sync in progress?
##[error]Error: The process '/usr/bin/sudo' failed with exit code 100

Every other CUDA and CUDA+cuQuantum job passed (128 green), only the Linux[2] runner caught the bad mirror window, so this is unrelated to the diff. A re-run of the two failed jobs should clear it. I do not have re-run rights on the repo so flagging it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants