QuEST #749 — optimise away small GPU qubit-list copies (Thrust) [unitaryHACK] by mk0dz · Pull Request #792 · QuEST-Kit/QuEST

mk0dz · 2026-06-12T23:21:55Z

/claim #749

The ask

QuEST's GPU Thrust backend copies operator qubit-lists from host to device on every
call (getDevInts). For small Qureg, that tiny cudaMalloc+cudaMemcpy costs more
than applying the operator. Profile it, quantify the % of runtime and the max speedup
from removing it, then optimise a candidate function to eliminate the copy — the way
thrust_statevec_calcExpecAnyTargZ_sub already does, by passing a bitmask (a
qindex primitive that rides in the functor) instead of a device list.

Profiling (CUDA events; Nsight Systems optional)

nsys isn't bundled with CUDA 13.3, so I isolated the two costs with CUDA events
(more reproducible than reading a timeline). Microbenchmark scratch/thrust_copy_749.cu:

getDevInts (device_vector from a 3-element host list) = fixed ~6.5 µs,
independent of qubit count (it's cudaMalloc + cudaMemcpy launch latency).
The projector thrust::transform scales with 2ⁿ: ~6.6 µs (4q) → ~1679 µs (24q).
So the copy is ~50 % of runtime below ~12 qubits → ~2× max speedup; negligible
above ~20 qubits. (Plot: scratch/plots_749/copy_overhead.png.)

(If nsys is installed: scratch/nsys_target_749.cpp profiles the issue's example —
phase 1 = copy-free calcExpecPauliStr, phase 2 = applyMultiQubitProjector — so the
cudaMemcpy is visible in the timeline before the fix and gone after.)

The fix (bitmask functors — exactly the issue's suggested technique)

Two candidate functions converted from a device qubit-list to two qindex bitmasks
passed by value in the functor (qubitMask = projected positions,
valueMask = util_getBitMask(qubits, outcomes) = those positions set to the outcome):

functor_projectStateVec / thrust_statevec_multiQubitProjector_sub
functor_projectDensMatr / thrust_densmatr_multiQubitProjector_sub

The per-amp test reduces to (index & qubitMask) == valueMask (density matrix: both
row and column must match) — no device allocation, and loop-free so no NumTargets
unrolling is needed either. The getDevInts calls are gone from both.

One file changed: quest/src/gpu/gpu_thrust.cuh (~50/−54). The dispatch template
param NumQubits is retained (unused) so the caller/INSTANTIATE machinery is untouched.

Correctness

GPU Catch2 suite, projector cases (cover both statevec & densmatr Quregs), with
both optimisations applied: All tests passed (16576 assertions in 2 test cases)
(applyQubitProjector, applyMultiQubitProjector).

Results (GTX 1650, fp64)

End-to-end applyMultiQubitProjector before/after (scratch/plots_749/before_after.png):
1.5–3× faster at 12–18 qubits, converging to 1× when the 2ⁿ transform dominates.

The very-small (4–10 q) regime stays ~flat end-to-end because applyMultiQubitProjector
also calls calcProbOfMultiQubitOutcome, which has its own getDevInts (see
scope note). The isolated microbenchmark already shows the ~2× available there.

Scope note — what I did NOT change, and why

thrust_statevec/densmatr_calcProbOfMultiQubitOutcome_sub also copy a qubit list, but
via functor_insertBits, which inserts bits at the qubit positions to enumerate
matching basis states. That genuinely needs the positions; a mask-only version needs a
device bit-deposit (PDEP-style) or a by-value fixed-size array in the functor, with a
real per-element vs copy trade-off. That is the "much more substantial" optimisation
the issue defers to #739, so I left it as a follow-up rather than risk a regression.
The two projectors are the clean, unambiguous bitmask wins the issue points to.

Files changed (PR contents)

 quest/src/gpu/gpu_thrust.cuh   (two projector functors + their _sub callers)

Reproduce

cmake -B gbuild -G Ninja -DCMAKE_BUILD_TYPE=Release -DQUEST_ENABLE_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=75
cmake --build gbuild --target QuEST
nvcc -O3 -arch=sm_75 scratch/thrust_copy_749.cu -o scratch/thrust_copy_749 && scratch/thrust_copy_749
.venv/bin/python scratch/plot_749.py

…uEST-Kit#749) multiQubitProjector's CPU... GPU Thrust path copied the qubit list to a fresh device_vector every call (getDevInts); for small Quregs this cudaMalloc+memcpy dominates runtime. Following thrust_statevec_calcExpecAnyTargZ_sub, the statevec and densmatr projector functors now take two qindex bitmasks by value instead, reducing the per-amp test to (index & qubitMask) == valueMask. No device list, no alloc/copy. Profiled (CUDA events, GTX 1650): the copy is a fixed ~6.5us, ~50% of runtime below ~12 qubits; end-to-end applyMultiQubitProjector is 1.5-3x faster at 12-18 qubits. calcProbOfMultiQubitOutcome's insertBits copy is left as a follow-up (needs a mask-based bit-deposit; QuEST-Kit#739 territory).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QuEST #749 — optimise away small GPU qubit-list copies (Thrust) [unitaryHACK]#792

QuEST #749 — optimise away small GPU qubit-list copies (Thrust) [unitaryHACK]#792
mk0dz wants to merge 1 commit into
QuEST-Kit:develfrom
mk0dz:feat/749-thrust-projector-bitmask

mk0dz commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mk0dz commented Jun 12, 2026

The ask

Profiling (CUDA events; Nsight Systems optional)

The fix (bitmask functors — exactly the issue's suggested technique)

Correctness

Results (GTX 1650, fp64)

Scope note — what I did NOT change, and why

Files changed (PR contents)

Reproduce

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant