perf(decode): branchy fused decodeSequence on the AVX2 short arm by polaz · Pull Request #439 · structured-world/structured-zstd

polaz · 2026-06-21T13:36:30Z

Summary

Ports upstream zstd's branchy fused ZSTD_decodeSequence
(zstd_decompress_block.c:1228-1346) onto the AVX2 short-block
sequence-decode arm, then weaves our single-refill optimization back on top:

Branchy fused decode — the offset extra-bit read is folded INTO the
ofBits > 1 / == 0 / == 1 branches (rep-0 reads zero offset bits, rep-1 one
bit, real offsets the full width), and the repcode history rotation is
resolved inline in the same branch. This replaces the unconditional
triple-bit PEXT extract + separate branchless repcode table for the
short-arm path. Verified equal to the previous branchless resolution across
its full offset/litlen test matrix.
Single-ensure + unchecked reads — the common total <= 56 case does one
ensure_bits up front and reads OF/ML/LL through get_bits_unchecked,
removing the per-field refill branch. The per-sequence vendor-dispatch branch
is gone as well (reads route straight through bzhi).

Impact

Decode throughput on the decodecorpus z000033 fixture (i9, BMI2+AVX2,
perf stat cycles + wall, flat libzstd control in the same session):

	cycles	ns/iter	vs libzstd
base (after #437)	214.2B	1697µs	1.43×
this PR	194.7B	1545µs	1.30×

-8.9% decode cycles / wall. Output is byte-identical (the decoder produces
the same bytes; only the internal bit-read shape changed). The win is largest
where the sequence-execute path dominates (real decodecorpus data).

Only the AVX2 short-block arm changes; the long-pipeline arm, the VBMI2/BMI2
tiers, and the non-x86 paths are untouched.

Testing

cargo nextest run -p structured-zstd --features kernel_bmi2 — 815/815
pass on x86_64 (i9), including the cross_validation and
roundtrip_integrity suites.
cargo fmt --check + cargo clippy --lib clean.

Summary by CodeRabbit

Refactor
- Enhanced ZSTD decompression efficiency through improved sequence decoding and offset resolution processing.

…t arm Replace the unconditional PEXT triple-extract + branchless do_offset_history RULES table with a branch-for-branch port of upstream ZSTD_decodeSequence on the short-block (straight) arm: offset extra-bit read folded into the ofBits>1/==0/==1 branches, repcode history rotation resolved inline. Tests whether the donor branch shape wins in-complex (the isolated branchy-repcode swap regressed; this fuses the offset read with the predicate).

Weave our refill optimization back into the donor-shape decode: the common total<=56 case now does one ensure_bits up front, then reads of/ml/ll via get_bits_unchecked (no per-field refill branch), keeping the winning branchy offset/repcode structure. Wide-offset total>56 falls back to demand-refilled get_bits. Parameterised via the cshape_resolve! reader macro.

coderabbitai · 2026-06-21T13:36:40Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 761ccae8-7e3d-4b0b-92d7-4d20faa0e45b

📥 Commits

Reviewing files that changed from the base of the PR and between 198ae74 and 9b8eb58.

📒 Files selected for processing (1)

zstd/src/decoding/seq_decoder_avx2.rs

📝 Walkthrough

Walkthrough

Two new macros—cshape_resolve! and decode_seq_fused_cshape!—are added to the AVX2 sequence decoder. They fuse ZSTD offset/repcode resolution, in-place history rotation, and ml/ll extra-bit reads into a single decode step. The AVX2 short-block loop is updated to call the fused macro instead of the previous two-step decode_one_body! + do_offset_history(...) sequence.

Changes

AVX2 Fused Sequence Decode

Layer / File(s)	Summary
`cshape_resolve!` and `decode_seq_fused_cshape!` macros `zstd/src/decoding/seq_decoder_avx2.rs`	Introduces `cshape_resolve!` for inline offset/repcode branching and in-place `hist` rotation, and `decode_seq_fused_cshape!` that wraps it with ml/ll extra-bit reads. Selects a single `ensure_bits(total)` fast path when `total <= 56`, otherwise uses demand-refilled reads. Asserts nonzero resolved offset on exit.
Short-block loop wired to fused macro `zstd/src/decoding/seq_decoder_avx2.rs`	Replaces the per-sequence `decode_one_body!` + `do_offset_history(...)` two-step with one `decode_seq_fused_cshape!` call returning `(seq_ll, seq_ml, resolved_offset)`, forwarded to `execute_one_body!`. Error capture and sequence accounting remain structurally unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

structured-world/structured-zstd#55: Shares the same pattern of batching a single ensure_bits check followed by unchecked bit reads for LL/ML/OF decode steps.
structured-world/structured-zstd#291: Directly overlaps—introduces the monolithic AVX2 decode+execute pipeline in seq_decoder_avx2.rs that this PR extends with the fused cshape macro.
structured-world/structured-zstd#283: Both changes evolve the pipelined executor to carry a resolved actual_offset rather than the raw sequence offset field through to execution.

Poem

🐇 Hop, hop, the bits align,
No two-step shuffle, one fused line!
cshape_resolve! spins the hist in place,
Fast path chosen with unchecked grace.
The rabbit merges, leaps ahead—
One macro rules the decode thread! 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the main performance optimization: branching-based fused decodeSequence implementation on the AVX2 short-block decode path, which is the core change across the modified file.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/decode-branchy-fused-decodeseq

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-21T13:43:31Z

Codecov Report

❌ Patch coverage is 66.66667% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/decoding/seq_decoder_avx2.rs	66.66%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

greptile-apps · 2026-06-21T14:01:11Z

Greptile Summary

Replaces the PEXT-triple extraction + branchless repcode table on the AVX2 short-block arm with upstream zstd's branchy fused ZSTD_decodeSequence shape, inlining offset/repcode resolution and history rotation directly into the of_bits > 1 / == 0 / == 1 branches, and weaves back the single ensure_bits optimisation so the common total ≤ 56 path still does one refill upfront and all reads go through get_bits_unchecked.

cshape_resolve! handles the three-way split on of_bits: real offsets subtract 3 and rotate history, rep-0 (of_bits == 0) uses ll_base == 0 (equivalent to ll == 0 for all valid FSE tables) to pick hist[0] vs hist[1], and the rep-1 arm (of_bits == 1) computes off_code = 1 + ll0 + bit mapping all four (ll0, bit) combinations to the correct history slot — verified equivalent to do_offset_history across all cases.
decode_seq_fused_cshape! guards with total ≤ 56 before calling ensure_bits(total as u8), where total = of_bits + ml_bits + ll_bits is a safe upper-bound on actual consumption across every arm, so get_bits_unchecked calls that follow are always in-bound.

Confidence Score: 5/5

Safe to merge — only the AVX2 short-block arm changes, the repcode arithmetic is provably equivalent to the existing do_offset_history reference, and the single-refill bound is correctly maintained.

The three-way of_bits split covers all ZSTD offset codes correctly: real offsets reduce cleanly via (of_base + raw) - 3, the rep-0 no-bits arm uses ll_base == 0 which is a sound proxy for ll == 0 across both default and custom FSE tables (symbol 0 always has base_value=0, num_bits=0), and the rep-1 one-bit arm's off_code = 1 + ll0 + bit formula maps all four (ll0, bit) combinations to the correct history slot. History rotation was manually cross-checked against do_offset_history for every combination. The ensure_bits(total) call is conservative for the rep-0 arm (where offset consumes 0 bits) and exact for the other two arms, so get_bits_unchecked is never called with insufficient bits in the register.

No files require special attention.

Important Files Changed

Filename	Overview
zstd/src/decoding/seq_decoder_avx2.rs	Introduces `cshape_resolve!` and `decode_seq_fused_cshape!` macros replacing the PEXT-triple + branchless repcode table on the short-arm with upstream zstd's branchy fused decode; repcode history rotation is inlined correctly and the single-ensure optimisation is preserved.

_{Reviews (1): Last reviewed commit: "perf(decode): single-ensure + unchecked ..." | Re-trigger Greptile}

polaz added 2 commits June 21, 2026 16:33

polaz merged commit aed82fc into main Jun 21, 2026
28 checks passed

polaz deleted the perf/decode-branchy-fused-decodeseq branch June 21, 2026 14:03

sw-release-bot Bot mentioned this pull request Jun 21, 2026

chore: release v0.0.43 #438

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(decode): branchy fused decodeSequence on the AVX2 short arm#439

perf(decode): branchy fused decodeSequence on the AVX2 short arm#439
polaz merged 2 commits into
mainfrom
perf/decode-branchy-fused-decodeseq

polaz commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

codecov Bot commented Jun 21, 2026

Uh oh!

greptile-apps Bot commented Jun 21, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

polaz commented Jun 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Impact

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

codecov Bot commented Jun 21, 2026

Codecov Report

Uh oh!

greptile-apps Bot commented Jun 21, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

polaz commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading