QuEST #749 — optimise away small GPU qubit-list copies (Thrust) [unitaryHACK]#792
Open
mk0dz wants to merge 1 commit into
Open
QuEST #749 — optimise away small GPU qubit-list copies (Thrust) [unitaryHACK]#792mk0dz wants to merge 1 commit into
mk0dz wants to merge 1 commit into
Conversation
…uEST-Kit#749) multiQubitProjector's CPU... GPU Thrust path copied the qubit list to a fresh device_vector every call (getDevInts); for small Quregs this cudaMalloc+memcpy dominates runtime. Following thrust_statevec_calcExpecAnyTargZ_sub, the statevec and densmatr projector functors now take two qindex bitmasks by value instead, reducing the per-amp test to (index & qubitMask) == valueMask. No device list, no alloc/copy. Profiled (CUDA events, GTX 1650): the copy is a fixed ~6.5us, ~50% of runtime below ~12 qubits; end-to-end applyMultiQubitProjector is 1.5-3x faster at 12-18 qubits. calcProbOfMultiQubitOutcome's insertBits copy is left as a follow-up (needs a mask-based bit-deposit; QuEST-Kit#739 territory).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
/claim #749
The ask
QuEST's GPU Thrust backend copies operator qubit-lists from host to device on every
call (
getDevInts). For smallQureg, that tinycudaMalloc+cudaMemcpycosts morethan applying the operator. Profile it, quantify the % of runtime and the max speedup
from removing it, then optimise a candidate function to eliminate the copy — the way
thrust_statevec_calcExpecAnyTargZ_subalready does, by passing a bitmask (aqindexprimitive that rides in the functor) instead of a device list.Profiling (CUDA events; Nsight Systems optional)
nsysisn't bundled with CUDA 13.3, so I isolated the two costs with CUDA events(more reproducible than reading a timeline). Microbenchmark
scratch/thrust_copy_749.cu:getDevInts(device_vector from a 3-element host list) = fixed ~6.5 µs,independent of qubit count (it's
cudaMalloc+cudaMemcpylaunch latency).thrust::transformscales with 2ⁿ: ~6.6 µs (4q) → ~1679 µs (24q).above ~20 qubits. (Plot:
scratch/plots_749/copy_overhead.png.)(If
nsysis installed:scratch/nsys_target_749.cppprofiles the issue's example —phase 1 = copy-free
calcExpecPauliStr, phase 2 =applyMultiQubitProjector— so thecudaMemcpyis visible in the timeline before the fix and gone after.)The fix (bitmask functors — exactly the issue's suggested technique)
Two candidate functions converted from a device qubit-list to two
qindexbitmaskspassed by value in the functor (
qubitMask= projected positions,valueMask = util_getBitMask(qubits, outcomes)= those positions set to the outcome):functor_projectStateVec/thrust_statevec_multiQubitProjector_subfunctor_projectDensMatr/thrust_densmatr_multiQubitProjector_subThe per-amp test reduces to
(index & qubitMask) == valueMask(density matrix: bothrow and column must match) — no device allocation, and loop-free so no
NumTargetsunrolling is needed either. The
getDevIntscalls are gone from both.One file changed:
quest/src/gpu/gpu_thrust.cuh(~50/−54). The dispatch templateparam
NumQubitsis retained (unused) so the caller/INSTANTIATE machinery is untouched.Correctness
GPU Catch2 suite, projector cases (cover both statevec & densmatr Quregs), with
both optimisations applied: All tests passed (16576 assertions in 2 test cases)
(
applyQubitProjector,applyMultiQubitProjector).Results (GTX 1650, fp64)
End-to-end
applyMultiQubitProjectorbefore/after (scratch/plots_749/before_after.png):1.5–3× faster at 12–18 qubits, converging to 1× when the 2ⁿ transform dominates.
The very-small (4–10 q) regime stays ~flat end-to-end because
applyMultiQubitProjectoralso calls
calcProbOfMultiQubitOutcome, which has its owngetDevInts(seescope note). The isolated microbenchmark already shows the ~2× available there.
Scope note — what I did NOT change, and why
thrust_statevec/densmatr_calcProbOfMultiQubitOutcome_subalso copy a qubit list, butvia
functor_insertBits, which inserts bits at the qubit positions to enumeratematching basis states. That genuinely needs the positions; a mask-only version needs a
device bit-deposit (PDEP-style) or a by-value fixed-size array in the functor, with a
real per-element vs copy trade-off. That is the "much more substantial" optimisation
the issue defers to #739, so I left it as a follow-up rather than risk a regression.
The two projectors are the clean, unambiguous bitmask wins the issue points to.
Files changed (PR contents)
Reproduce
cmake -B gbuild -G Ninja -DCMAKE_BUILD_TYPE=Release -DQUEST_ENABLE_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=75 cmake --build gbuild --target QuEST nvcc -O3 -arch=sm_75 scratch/thrust_copy_749.cu -o scratch/thrust_copy_749 && scratch/thrust_copy_749 .venv/bin/python scratch/plot_749.py