Skip to content

perf(decode): branchy fused decodeSequence on the AVX2 short arm#439

Merged
polaz merged 2 commits into
mainfrom
perf/decode-branchy-fused-decodeseq
Jun 21, 2026
Merged

perf(decode): branchy fused decodeSequence on the AVX2 short arm#439
polaz merged 2 commits into
mainfrom
perf/decode-branchy-fused-decodeseq

Conversation

@polaz

@polaz polaz commented Jun 21, 2026

Copy link
Copy Markdown
Member

Summary

Ports upstream zstd's branchy fused ZSTD_decodeSequence
(zstd_decompress_block.c:1228-1346) onto the AVX2 short-block
sequence-decode arm, then weaves our single-refill optimization back on top:

  • Branchy fused decode — the offset extra-bit read is folded INTO the
    ofBits > 1 / == 0 / == 1 branches (rep-0 reads zero offset bits, rep-1 one
    bit, real offsets the full width), and the repcode history rotation is
    resolved inline in the same branch. This replaces the unconditional
    triple-bit PEXT extract + separate branchless repcode table for the
    short-arm path. Verified equal to the previous branchless resolution across
    its full offset/litlen test matrix.
  • Single-ensure + unchecked reads — the common total <= 56 case does one
    ensure_bits up front and reads OF/ML/LL through get_bits_unchecked,
    removing the per-field refill branch. The per-sequence vendor-dispatch branch
    is gone as well (reads route straight through bzhi).

Impact

Decode throughput on the decodecorpus z000033 fixture (i9, BMI2+AVX2,
perf stat cycles + wall, flat libzstd control in the same session):

cycles ns/iter vs libzstd
base (after #437) 214.2B 1697µs 1.43×
this PR 194.7B 1545µs 1.30×

-8.9% decode cycles / wall. Output is byte-identical (the decoder produces
the same bytes; only the internal bit-read shape changed). The win is largest
where the sequence-execute path dominates (real decodecorpus data).

Only the AVX2 short-block arm changes; the long-pipeline arm, the VBMI2/BMI2
tiers, and the non-x86 paths are untouched.

Testing

  • cargo nextest run -p structured-zstd --features kernel_bmi2815/815
    pass
    on x86_64 (i9), including the cross_validation and
    roundtrip_integrity suites.
  • cargo fmt --check + cargo clippy --lib clean.

Summary by CodeRabbit

  • Refactor
    • Enhanced ZSTD decompression efficiency through improved sequence decoding and offset resolution processing.

polaz added 2 commits June 21, 2026 16:33
…t arm

Replace the unconditional PEXT triple-extract + branchless do_offset_history
RULES table with a branch-for-branch port of upstream ZSTD_decodeSequence on
the short-block (straight) arm: offset extra-bit read folded into the
ofBits>1/==0/==1 branches, repcode history rotation resolved inline. Tests
whether the donor branch shape wins in-complex (the isolated branchy-repcode
swap regressed; this fuses the offset read with the predicate).
Weave our refill optimization back into the donor-shape decode: the common
total<=56 case now does one ensure_bits up front, then reads of/ml/ll via
get_bits_unchecked (no per-field refill branch), keeping the winning branchy
offset/repcode structure. Wide-offset total>56 falls back to demand-refilled
get_bits. Parameterised via the cshape_resolve! reader macro.
@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 761ccae8-7e3d-4b0b-92d7-4d20faa0e45b

📥 Commits

Reviewing files that changed from the base of the PR and between 198ae74 and 9b8eb58.

📒 Files selected for processing (1)
  • zstd/src/decoding/seq_decoder_avx2.rs

📝 Walkthrough

Walkthrough

Two new macros—cshape_resolve! and decode_seq_fused_cshape!—are added to the AVX2 sequence decoder. They fuse ZSTD offset/repcode resolution, in-place history rotation, and ml/ll extra-bit reads into a single decode step. The AVX2 short-block loop is updated to call the fused macro instead of the previous two-step decode_one_body! + do_offset_history(...) sequence.

Changes

AVX2 Fused Sequence Decode

Layer / File(s) Summary
cshape_resolve! and decode_seq_fused_cshape! macros
zstd/src/decoding/seq_decoder_avx2.rs
Introduces cshape_resolve! for inline offset/repcode branching and in-place hist rotation, and decode_seq_fused_cshape! that wraps it with ml/ll extra-bit reads. Selects a single ensure_bits(total) fast path when total <= 56, otherwise uses demand-refilled reads. Asserts nonzero resolved offset on exit.
Short-block loop wired to fused macro
zstd/src/decoding/seq_decoder_avx2.rs
Replaces the per-sequence decode_one_body! + do_offset_history(...) two-step with one decode_seq_fused_cshape! call returning (seq_ll, seq_ml, resolved_offset), forwarded to execute_one_body!. Error capture and sequence accounting remain structurally unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐇 Hop, hop, the bits align,
No two-step shuffle, one fused line!
cshape_resolve! spins the hist in place,
Fast path chosen with unchecked grace.
The rabbit merges, leaps ahead—
One macro rules the decode thread! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main performance optimization: branching-based fused decodeSequence implementation on the AVX2 short-block decode path, which is the core change across the modified file.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/decode-branchy-fused-decodeseq

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 21, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 66.66667% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/decoding/seq_decoder_avx2.rs 66.66% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented Jun 21, 2026

Copy link
Copy Markdown

Greptile Summary

Replaces the PEXT-triple extraction + branchless repcode table on the AVX2 short-block arm with upstream zstd's branchy fused ZSTD_decodeSequence shape, inlining offset/repcode resolution and history rotation directly into the of_bits > 1 / == 0 / == 1 branches, and weaves back the single ensure_bits optimisation so the common total ≤ 56 path still does one refill upfront and all reads go through get_bits_unchecked.

  • cshape_resolve! handles the three-way split on of_bits: real offsets subtract 3 and rotate history, rep-0 (of_bits == 0) uses ll_base == 0 (equivalent to ll == 0 for all valid FSE tables) to pick hist[0] vs hist[1], and the rep-1 arm (of_bits == 1) computes off_code = 1 + ll0 + bit mapping all four (ll0, bit) combinations to the correct history slot — verified equivalent to do_offset_history across all cases.
  • decode_seq_fused_cshape! guards with total ≤ 56 before calling ensure_bits(total as u8), where total = of_bits + ml_bits + ll_bits is a safe upper-bound on actual consumption across every arm, so get_bits_unchecked calls that follow are always in-bound.

Confidence Score: 5/5

Safe to merge — only the AVX2 short-block arm changes, the repcode arithmetic is provably equivalent to the existing do_offset_history reference, and the single-refill bound is correctly maintained.

The three-way of_bits split covers all ZSTD offset codes correctly: real offsets reduce cleanly via (of_base + raw) - 3, the rep-0 no-bits arm uses ll_base == 0 which is a sound proxy for ll == 0 across both default and custom FSE tables (symbol 0 always has base_value=0, num_bits=0), and the rep-1 one-bit arm's off_code = 1 + ll0 + bit formula maps all four (ll0, bit) combinations to the correct history slot. History rotation was manually cross-checked against do_offset_history for every combination. The ensure_bits(total) call is conservative for the rep-0 arm (where offset consumes 0 bits) and exact for the other two arms, so get_bits_unchecked is never called with insufficient bits in the register.

No files require special attention.

Important Files Changed

Filename Overview
zstd/src/decoding/seq_decoder_avx2.rs Introduces cshape_resolve! and decode_seq_fused_cshape! macros replacing the PEXT-triple + branchless repcode table on the short-arm with upstream zstd's branchy fused decode; repcode history rotation is inlined correctly and the single-ensure optimisation is preserved.

Reviews (1): Last reviewed commit: "perf(decode): single-ensure + unchecked ..." | Re-trigger Greptile

@polaz polaz merged commit aed82fc into main Jun 21, 2026
28 checks passed
@polaz polaz deleted the perf/decode-branchy-fused-decodeseq branch June 21, 2026 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant