[QDP] [feature] Pr4 kronecker fwt#1391
Open
aloha1357 wants to merge 6 commits into
Open
Conversation
1827493 to
62c249b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issues
related #1385
Changes
Why
For large qubit counts ($N > 12$ ), the standard Fast Walsh-Hadamard Transform (FWT) algorithm becomes severely bound by Global Memory bandwidth. The FWT requires $\log_2(N)$ stages of in-place memory access across the entire $2^N$ state vector, which causes cache thrashing and massive DRAM roundtrips.
To overcome this, we mathematically restructure the FWT into a Kronecker Product Decomposition:$H_n = H_{n/2} \otimes H_{n/2}$ . This transforms the sparse, memory-bound butterfly operations into standard, dense matrix multiplications (GEMM) using a Blocked architecture.
While the implicit Hadamard engine is not yet introduced (coming in PR 5), this PR establishes the structural memory layout, allocation, and transpose logic necessary for the decomposition.
How
launch_iqp_encode_tcto dynamically split the state vector into two dimensions (d_temp_real,d_temp_imag) to store the transposed matrix blocks during the 4-step blocked algorithm.naive_implicit_hadamard_gemm_kernelas a fallback structural placeholder. It calculates the Hadamard values on-the-fly ($\text{popc}(k \ &\ i)$) and executes the block multiplicationiqp_tc_batch_transpose_kernel(introduced in PR 2) to transpose the blocks between the two GEMM stages, achieving the mathematical equivalent of theBenchmark Results
Environment: Dev Machine (NVIDIA GeForce RTX 4060 Laptop GPU)
Configuration: Batch Size: 64, Iterations: 30
Script:
qdp/qdp-python/benchmark/benchmark_pr4.pyMeasured: 2026-06-10
pr2-implicit-fwt-rework, batch=64At N=14 the Kronecker scaffold is ~3.5% slower than global FWT on this GPU. This is expected and acceptable for an architecture PR (see below).
Why submit PR4 even though N=14 is slightly slower
1. This is a structural PR, not the final performance PR.
The global FWT path in PR2/PR3 is highly tuned for in-place butterfly access in DRAM. PR4 intentionally reframes the Walsh-Hadamard transform as a Kronecker-product blocked GEMM (
H_n = H_{n/2} ⊗ H_{n/2}) so that later stages can call dense matrix kernels. The current implementation uses a naive implicit Hadamard GEMM placeholder — it computes Walsh coefficients on-the-fly with__popcbut does not yet use Tensor Cores or cuBLAS. Comparing an unoptimized dense GEMM scaffold against a specialized FWT is expected to be parity or slightly worse in the short term.2. The regression is within measurement noise and there is no functional regression.
~3.5% (3.404 ms → 3.523 ms) is inside typical GPU timing variance on a laptop GPU. All IQP binding tests pass (20/20). The public API and numerical outputs remain correct; we are not trading correctness for experimental structure.
3. PR4 solves a problem PR2 and PR3 cannot.
PR3 shared-memory fusion only applies when
N <= 12because the full state must fit in shared memory. ForN > 12(e.g. N=14, state size 16384 complex amplitudes), encoding still relies on global-memory FWT with repeated DRAM traversals per stage. PR4's value is converting that large-N workload into a blocked GEMM + transpose pipeline that Tensor Cores and vendor BLAS can accelerate — a path the butterfly kernels cannot take directly.4. Downstream PRs depend on this layout.
PR4 lands the infrastructure that later PRs plug into:
Y = (H_{n1} ⊗ I) (I ⊗ H_{n2}) Xd_temp_real,d_temp_imag)launch_iqp_encode_tcdispatch for large NChecklist