perf(decode): branchy fused decodeSequence on the AVX2 short arm#439
Conversation
…t arm Replace the unconditional PEXT triple-extract + branchless do_offset_history RULES table with a branch-for-branch port of upstream ZSTD_decodeSequence on the short-block (straight) arm: offset extra-bit read folded into the ofBits>1/==0/==1 branches, repcode history rotation resolved inline. Tests whether the donor branch shape wins in-complex (the isolated branchy-repcode swap regressed; this fuses the offset read with the predicate).
Weave our refill optimization back into the donor-shape decode: the common total<=56 case now does one ensure_bits up front, then reads of/ml/ll via get_bits_unchecked (no per-field refill branch), keeping the winning branchy offset/repcode structure. Wide-offset total>56 falls back to demand-refilled get_bits. Parameterised via the cshape_resolve! reader macro.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughTwo new macros— ChangesAVX2 Fused Sequence Decode
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
| Filename | Overview |
|---|---|
| zstd/src/decoding/seq_decoder_avx2.rs | Introduces cshape_resolve! and decode_seq_fused_cshape! macros replacing the PEXT-triple + branchless repcode table on the short-arm with upstream zstd's branchy fused decode; repcode history rotation is inlined correctly and the single-ensure optimisation is preserved. |
Reviews (1): Last reviewed commit: "perf(decode): single-ensure + unchecked ..." | Re-trigger Greptile
Summary
Ports upstream zstd's branchy fused
ZSTD_decodeSequence(
zstd_decompress_block.c:1228-1346) onto the AVX2 short-blocksequence-decode arm, then weaves our single-refill optimization back on top:
ofBits > 1 / == 0 / == 1branches (rep-0 reads zero offset bits, rep-1 onebit, real offsets the full width), and the repcode history rotation is
resolved inline in the same branch. This replaces the unconditional
triple-bit PEXT extract + separate branchless repcode table for the
short-arm path. Verified equal to the previous branchless resolution across
its full offset/litlen test matrix.
total <= 56case does oneensure_bitsup front and reads OF/ML/LL throughget_bits_unchecked,removing the per-field refill branch. The per-sequence vendor-dispatch branch
is gone as well (reads route straight through
bzhi).Impact
Decode throughput on the decodecorpus
z000033fixture (i9, BMI2+AVX2,perf statcycles + wall, flatlibzstdcontrol in the same session):-8.9% decode cycles / wall. Output is byte-identical (the decoder produces
the same bytes; only the internal bit-read shape changed). The win is largest
where the sequence-execute path dominates (real decodecorpus data).
Only the AVX2 short-block arm changes; the long-pipeline arm, the VBMI2/BMI2
tiers, and the non-x86 paths are untouched.
Testing
cargo nextest run -p structured-zstd --features kernel_bmi2— 815/815pass on x86_64 (i9), including the
cross_validationandroundtrip_integritysuites.cargo fmt --check+cargo clippy --libclean.Summary by CodeRabbit