Fuse distributed prefix-suffix multi-SWAP (closes #595)#785
Conversation
The localiser performed each prefix<->suffix SWAP in turn, so an amplitude moved by one SWAP was often moved again by the next, crossing the network several times. This fuses the group of disjoint SWAPs into one operation that computes each amplitude's final node and sends it there directly, so every amplitude crosses the network at most once. The disjoint SWAPs commute and compose into a single bit permutation. For the uncontrolled case (every internal caller) the routine enumerates the up to 2^eta-1 destination nodes and packs, exchanges and unpacks only the amplitudes bound to each. A new cpu_statevec_unpackAmpsFromBuffer scatters the received sub-buffer back into the strided local amplitudes, the inverse of the existing packer, looping over moved amplitudes not the whole state. Scope is CPU/OpenMP. GPU quregs and controlled multi-SWAPs keep the existing per-SWAP path, so the GPU build is unchanged. Comm volume drops 25% at eta=2 and 42% at eta=3 (1 - 1/2^eta), matching theory. Existing applySwap, applyCompMatr, applyCompMatr2 and calcPartialTrace suites pass at 1, 2, 4 and 8 ranks.
|
Now this is a beautiful diff!! 🎉 🎉 (I just gave that praise elsewhere prematurely ehehe). Are you able to share the code you used for benchmarking? Feel free to paste it here as a comment, or add the file to your diff (we can remove it later). The results are promising. I expect the packing cost to be totally occluded in settings where local amp movement is much smaller than communication (as is typical). Otherwise, I note here (mostly for posterity) that some further optimizations may be possible. Namely, the serial loop over Presently, the logic is numSubsets = powerOf2(numSwaps)
for each subset:
find pair rank
pack targeted amps (unique to the pair rank) into buffer
exchange sub-buffers
unpack amps from bufferSince We could instead...
This sees the same total data sent over the network as the current solution, but reduces the number of syncs, and keeps caches "hot". It's then obvious how to optimise this scheme further - avoid the global sync, and have each process start updating one pair rank's worth of amplitudes as soon as it is received. Some or all threads can be dedicated to each, overlapping communication and processing. This has little benefit when the network is always much much slower than local processing (the overlap hides little), but great benefit when the network is liable to saturation and slowdown (as is the current solution, I note). This next layer of optimisation may prove "too deep" an optimisation for QuEST"'s software architecture however 🙏 Overall a great PR! I'll investigate this ASAP |
|
Thanks! Benchmark code below. Two pieces: the wall-clock driver (what the speedup Wall-clock driver
/* Times applyCompMatr on the highest k qubits, which forces a prefix<->suffix
* multi-SWAP each call. Link against the fused branch and against devel to compare.
* QuEST-Kit/QuEST #595. */
#include "quest.h"
#include <vector>
#include <chrono>
#include <cstdio>
#include <cstdlib>
#include <cmath>
static int popcountLong(long x) { int c=0; while (x){ c+=(int)(x&1L); x>>=1; } return c; }
int main(int argc, char** argv) {
initQuESTEnv();
int n = (argc > 1) ? atoi(argv[1]) : 28; // total qubits
int k = (argc > 2) ? atoi(argv[2]) : 4; // targeted qubits (placed at the top)
int reps = (argc > 3) ? atoi(argv[3]) : 20; // timed repetitions
int mt = (argc > 4) ? atoi(argv[4]) : 1; // isMultithreaded (0 = single-thread node)
Qureg q = createCustomQureg(n, 0, /*isDistrib*/ 1, /*isGpu*/ 0, /*isMultithread*/ mt);
initZeroState(q);
// k-qubit Hadamard tensor: a unitary whose square is identity, so repeated
// application stays normalised while exercising the same communication.
CompMatr m = createCompMatr(k);
long dim = 1L << k;
qreal norm = (qreal) pow(1.0 / sqrt(2.0), k);
for (long i = 0; i < dim; i++)
for (long j = 0; j < dim; j++)
m.cpuElems[i][j] = qcomp((popcountLong(i & j) & 1) ? -norm : norm, 0);
syncCompMatr(m);
std::vector<int> targs; // highest k qubits -> max prefix targets
for (int i = 0; i < k; i++) targs.push_back(n - 1 - i);
applyCompMatr(q, targs, m); // warm up
syncQuESTEnv();
auto t0 = std::chrono::steady_clock::now();
for (int r = 0; r < reps; r++) applyCompMatr(q, targs, m);
syncQuESTEnv();
auto t1 = std::chrono::steady_clock::now();
double secs = std::chrono::duration<double>(t1 - t0).count() / reps;
QuESTEnv env = getQuESTEnv();
if (env.rank == 0)
printf("n=%d k=%d nodes=%d mt=%d reps=%d time_per_apply=%.6f s\n",
n, k, env.numNodes, mt, reps, secs);
destroyCompMatr(m);
destroyQureg(q);
finalizeQuESTEnv();
return 0;
}Run (single box standing in for a bandwidth-limited link, see note): # fused build vs devel build, single-thread node (mt=0), eta=3 (np=8)
export MPIR_CVAR_NOLOCAL=1 FI_PROVIDER=tcp # MPICH: treat each rank as a separate
# node + route over TCP, so the saved
# comm volume is actually bandwidth-bound
mpirun -np 8 ./bench_fused 29 4 2 0
mpirun -np 8 ./bench_baseline 29 4 2 0One caveat I want to be upfront about: I only had a single physical machine, where Single-thread (mt=0), TCP transport, k=4, eta=3 (np=8), means over repeated trials on
The win grows with state size: there is a crossover below which the fused routine's Exact comm-volume reduction (throwaway counter)The volume cut itself is exact and hardware-independent. I measured it with a tiny static qindex sub_buffer_amp_tally = 0;
qindex comm_getSubBufferAmpTally() { return sub_buffer_amp_tally; }
void comm_resetSubBufferAmpTally() { sub_buffer_amp_tally = 0; }and one line at the top of sub_buffer_amp_tally += numAmps; // benchmark scaffolding onlyTally
The mechanism: per On the parallelisation notes: agreed and thanks for writing them out. Packing all One more note unrelated to the above: the two red |
|
The two red checks are both Every other CUDA and CUDA+cuQuantum job passed (128 green), only the Linux[2] runner caught the bad mirror window, so this is unrelated to the diff. A re-run of the two failed jobs should clear it. I do not have re-run rights on the repo so flagging it here. |
Summary
Closes #595. Fuses the distributed prefix<->suffix multi-SWAP so each amplitude crosses the network at most once.
When a multi-qubit gate targets qubits that live in the prefix (the index bits that select which node holds an amplitude), QuEST first swaps those qubits down into the suffix. The localiser did this one SWAP at a time, so an amplitude moved by the first SWAP was often moved again by the next, crossing the network several times. This change works out each amplitude's final node up front and sends it there directly.
The SWAPs in such a group act on disjoint qubit pairs, so they commute and compose into a single permutation of the index bits, which is what makes the direct routing well defined. For the uncontrolled case (every internal caller:
applyCompMatr,applyCompMatr2, the partial-trace path) the routine enumerates the up to2^eta - 1destination nodes, one per non-empty subset of theetaprefix targets whose partnered suffix bit disagrees with this node's rank bit. For each it packs, exchanges and unpacks only the amplitudes bound there. The move is an involution between paired nodes, so packed and unpacked amplitudes sit in the same local slots.Design notes
This follows the two constraints from the issue thread:
New CPU kernel
cpu_statevec_unpackAmpsFromBufferis the inverse of the existingcpu_statevec_packAmpsIntoBuffer: an OpenMP scatter that writes the contiguous received sub-buffer back into the strided local amplitudes selected by several constrained qubits, viainsertBitsWithMaskedValues, so it loops over O(amplitudes moved) and never over O(2^N).Scope
CPU/OpenMP, which the issue notes is sufficient. GPU quregs and controlled multi-SWAPs keep the existing per-SWAP path, so the GPU build and its numerics are untouched. A GPU mirror of the kernel is written and ready as a follow-up, kept out of this PR because I have no CUDA hardware to compile it on and did not want this change to risk the GPU build.
Files:
core/localiser.cpp,core/accelerator.cpp,core/accelerator.hpp,cpu/cpu_subroutines.cpp,cpu/cpu_subroutines.hpp.Correctness
The fused routine must give bit-identical results to the per-SWAP path. The existing suites compare against an independent reference state and pass at 1, 2, 4 and 8 ranks:
(
applySwap,applyCompMatr,applyCompMatr2,calcPartialTrace.)Benchmark
Communication volume (exact, hardware independent). Measured by tallying amplitudes pushed through the sub-buffer exchange:
The fused group moves a fraction
1 - 1/2^etaof the partition once, where the per-swap path relays it acrossetaexchanges (each moving half a partition, applied twice over the down-and-back swap). So the ratio is2(1 - 1/2^eta)/eta:3/4at eta=2 and7/12 ~ 0.583at eta=3, matching the table.Wall clock. I do not have a cluster, so I cannot measure the real multi-node speedup directly. On a single box, intra-node MPI is shared memory with no bandwidth limit, so the saved volume costs nothing and the routine is slower there. To measure the bandwidth-limited regime that distribution actually runs in, I forced MPICH off its shared-memory shortcut onto the TCP transport (
MPIR_CVAR_NOLOCAL=1 FI_PROVIDER=tcp), which gives a genuine bandwidth-limited path between ranks on one machine. This is an emulation, not a real cluster. I flag it as such.Single thread per node (
mt=0, the case the issue asks to verify), eta=3, the speedup grows with state size as the saved volume starts to outweigh the fused routine's extra rounds:At the largest state tested, single threaded, fused is faster on every trial. With OpenMP on (the realistic deployment, where the extra packing is parallelised away) the win is larger and cleaner: n=28, 8 ranks, fused 4.831 s vs baseline 5.425 s, +11%. For eta=2 the 25% volume cut is too small to beat the extra-round overhead single threaded and the result is a tie.
So the extra packing does not outweigh the comm saving: single threaded it is a wash at small state and a win at large state. Once threads or a real (slower than loopback) interconnect enter, it wins across the range. Happy to have this confirmed on a real cluster via CI.
AI disclosure
This change was implemented with substantial help from Claude (Anthropic), which drafted the fused routing in the localiser, the unpack kernel and the benchmark and proposed the subset-enumeration design. I reviewed the approach against the issue thread and the QuEST distributed paper (arXiv:2311.01512). I ran the tests at 1/2/4/8 ranks and the benchmark and own the change. Verified locally before submitting: the CPU/OpenMP suites green at 1/2/4/8 ranks, the comm-volume reduction measured directly and the wall-clock benchmark run under the emulated transport above. The GPU path was not compiled (no CUDA hardware) and is deliberately left out.