Skip to content

QuEST #749 — optimise away small GPU qubit-list copies (Thrust) [unitaryHACK]#792

Open
mk0dz wants to merge 1 commit into
QuEST-Kit:develfrom
mk0dz:feat/749-thrust-projector-bitmask
Open

QuEST #749 — optimise away small GPU qubit-list copies (Thrust) [unitaryHACK]#792
mk0dz wants to merge 1 commit into
QuEST-Kit:develfrom
mk0dz:feat/749-thrust-projector-bitmask

Conversation

@mk0dz

@mk0dz mk0dz commented Jun 12, 2026

Copy link
Copy Markdown

/claim #749

The ask

QuEST's GPU Thrust backend copies operator qubit-lists from host to device on every
call (getDevInts). For small Qureg, that tiny cudaMalloc+cudaMemcpy costs more
than applying the operator. Profile it, quantify the % of runtime and the max speedup
from removing it, then optimise a candidate function to eliminate the copy — the way
thrust_statevec_calcExpecAnyTargZ_sub already does, by passing a bitmask (a
qindex primitive that rides in the functor) instead of a device list.

Profiling (CUDA events; Nsight Systems optional)

nsys isn't bundled with CUDA 13.3, so I isolated the two costs with CUDA events
(more reproducible than reading a timeline). Microbenchmark scratch/thrust_copy_749.cu:

  • getDevInts (device_vector from a 3-element host list) = fixed ~6.5 µs,
    independent of qubit count (it's cudaMalloc + cudaMemcpy launch latency).
  • The projector thrust::transform scales with 2ⁿ: ~6.6 µs (4q) → ~1679 µs (24q).
  • So the copy is ~50 % of runtime below ~12 qubits → ~2× max speedup; negligible
    above ~20 qubits. (Plot: scratch/plots_749/copy_overhead.png.)

(If nsys is installed: scratch/nsys_target_749.cpp profiles the issue's example —
phase 1 = copy-free calcExpecPauliStr, phase 2 = applyMultiQubitProjector — so the
cudaMemcpy is visible in the timeline before the fix and gone after.)

The fix (bitmask functors — exactly the issue's suggested technique)

Two candidate functions converted from a device qubit-list to two qindex bitmasks
passed by value in the functor (qubitMask = projected positions,
valueMask = util_getBitMask(qubits, outcomes) = those positions set to the outcome):

  • functor_projectStateVec / thrust_statevec_multiQubitProjector_sub
  • functor_projectDensMatr / thrust_densmatr_multiQubitProjector_sub

The per-amp test reduces to (index & qubitMask) == valueMask (density matrix: both
row and column must match) — no device allocation, and loop-free so no NumTargets
unrolling is needed either. The getDevInts calls are gone from both.

One file changed: quest/src/gpu/gpu_thrust.cuh (~50/−54). The dispatch template
param NumQubits is retained (unused) so the caller/INSTANTIATE machinery is untouched.

Correctness

GPU Catch2 suite, projector cases (cover both statevec & densmatr Quregs), with
both optimisations applied: All tests passed (16576 assertions in 2 test cases)
(applyQubitProjector, applyMultiQubitProjector).

Results (GTX 1650, fp64)

End-to-end applyMultiQubitProjector before/after (scratch/plots_749/before_after.png):
1.5–3× faster at 12–18 qubits, converging to 1× when the 2ⁿ transform dominates.

The very-small (4–10 q) regime stays ~flat end-to-end because applyMultiQubitProjector
also calls calcProbOfMultiQubitOutcome, which has its own getDevInts (see
scope note). The isolated microbenchmark already shows the ~2× available there.

before_after copy_overhead

Scope note — what I did NOT change, and why

thrust_statevec/densmatr_calcProbOfMultiQubitOutcome_sub also copy a qubit list, but
via functor_insertBits, which inserts bits at the qubit positions to enumerate
matching basis states. That genuinely needs the positions; a mask-only version needs a
device bit-deposit (PDEP-style) or a by-value fixed-size array in the functor, with a
real per-element vs copy trade-off. That is the "much more substantial" optimisation
the issue defers to #739, so I left it as a follow-up rather than risk a regression.
The two projectors are the clean, unambiguous bitmask wins the issue points to.

Files changed (PR contents)

 quest/src/gpu/gpu_thrust.cuh   (two projector functors + their _sub callers)

Reproduce

cmake -B gbuild -G Ninja -DCMAKE_BUILD_TYPE=Release -DQUEST_ENABLE_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=75
cmake --build gbuild --target QuEST
nvcc -O3 -arch=sm_75 scratch/thrust_copy_749.cu -o scratch/thrust_copy_749 && scratch/thrust_copy_749
.venv/bin/python scratch/plot_749.py

…uEST-Kit#749)

multiQubitProjector's CPU... GPU Thrust path copied the qubit list to a fresh
device_vector every call (getDevInts); for small Quregs this cudaMalloc+memcpy
dominates runtime. Following thrust_statevec_calcExpecAnyTargZ_sub, the statevec
and densmatr projector functors now take two qindex bitmasks by value instead,
reducing the per-amp test to (index & qubitMask) == valueMask. No device list,
no alloc/copy.

Profiled (CUDA events, GTX 1650): the copy is a fixed ~6.5us, ~50% of runtime
below ~12 qubits; end-to-end applyMultiQubitProjector is 1.5-3x faster at 12-18
qubits. calcProbOfMultiQubitOutcome's insertBits copy is left as a follow-up
(needs a mask-based bit-deposit; QuEST-Kit#739 territory).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant