perf(decode): HUF table-fill by code-length group + scalarize 4-stream burst by polaz · Pull Request #437 · structured-world/structured-zstd

polaz · 2026-06-21T09:53:34Z

Summary

Two structural instruction-count reductions on the decodecorpus decode hot path, validated cycle-for-cycle on i9-9900K (z000033, perf stat, flat c_ffi control).

HUF decode-table fill by code-length group. Port the upstream zstd HUF_readDTableX1_wksp fill shape: counting-sort the symbols by code length, then fill the table one length GROUP at a time. Within a group the run length and entry nbBits are constant, so a match on the run length hoists out of the symbol loop and the optimiser emits a straight-line specialised store per group (1/2 scalar, 4/8/16+ via 64-bit broadcast). The previous per-symbol form recomputed a variable run length in a runtime-trip-count loop that blocked unrolling. build_decoder self-time drops 9.6% to 2.4%.
Scalarize the 4-stream HUF burst registers. The literal burst held its per-stream state in [_; 4] arrays; cursors arrived through a &mut [usize; 4] reference (caller-owned) so it could never be promoted to registers, and every decoded symbol did a memory RMW. Scalarize all four streams into named locals so they stay register-resident across the whole burst+reload, matching the upstream register-resident loop.

Measured impact (i9-9900K, z000033 decodecorpus, 8000 iters)

	wall ns/iter	rust/ffi
baseline	1,816,860	1.53x
+ table fill	1,721,220	1.45x
+ scalarize	1,685,923	1.42x
c_ffi (libzstd)	1,185,936	1.00x

Net -7.2% wall on decodecorpus decode. Both changes are kernel-independent (scalar code), so the win carries to every CPU tier. Instruction-count driven (our IPC already exceeds libzstd on this path); the cycle/wall delta tracks the instruction reduction.

Testing

cargo nextest run -p structured-zstd --features std,hash,dict_builder: 823/823
cargo nextest run -p ffi-bench --test cross_validation: 23/23 (byte-identical output)
cargo clippy --features hash,std,dict_builder -- -D warnings: clean
Bench on i9-9900K, 5 stable runs, flat c_ffi control arm

Output is byte-identical pre/post (decode tables + burst produce the same bytes).

Summary by CodeRabbit

Refactor
- Streamlined the HUF 4-stream “burst” decoding flow to use more efficient internal state handling for faster decompression.
- Reworked Huffman decode lookup-table construction to fill entries by length groups using a sorting/counting approach.
Performance
- Improved decompression speed through optimized burst decoding and more efficient lookup-table generation patterns.

Port the upstream zstd HUF_readDTableX1_wksp fill shape: counting-sort the symbols by code length, then fill the decode table one length GROUP at a time instead of one symbol at a time. Within a group the run length (1 << (max_bits - len)) and the entry nbBits are constant, so a match on the run length hoists out of the symbol loop and the optimiser emits a straight-line specialised store per group (1/2 scalar, 4/8/16+ via 64-bit broadcast). The old per-symbol form recomputed a variable run length and used a runtime-trip-count while loop that blocked unrolling.

The 4-stream literal burst held its per-stream state (bits, ip, nb_bits_last, cursors) in [_; 4] arrays. cursors arrived through a &mut [usize; 4] reference (caller-owned), so it could never be promoted to registers; every decoded symbol did a memory RMW on cursors[s], and the array bits/ip reloaded from the stack too. Profiling z000033 decode showed the burst at ~23.6 instructions per literal symbol versus libzstd's hand-written 4X1 fast loop at ~1.1 — a ~20x gap, the single largest decodecorpus decode divergence. Scalarize all four per-stream registers into named locals (b0..b3, ip0..ip3, nbl0..nbl3, c0..c3) so the optimiser keeps them in registers across the whole burst+reload, matching the upstream register-resident loop. Cursors are copied in at entry and written back before the drain. Byte-identical output (823 lib + 23 cross-validation tests).

coderabbitai · 2026-06-21T09:53:46Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 23f441cb-cde8-40bd-b557-461386a631cf

📥 Commits

Reviewing files that changed from the base of the PR and between 9dc916c and c34bc4a.

📒 Files selected for processing (1)

zstd/src/huff0/huff0_decoder.rs

📝 Walkthrough

Walkthrough

The PR optimizes two components of the HUF decode pipeline. In huff0_decoder.rs, build_table_from_weights replaces its per-symbol table fill with a counting-sort pre-pass and length-ranked unsafe bulk store patterns. In literals_section_decoder.rs, run_4stream_burst_loop removes the burst_decode_symbols helper, scalarizes all per-stream state into locals, and introduces local macro_rules! for decode, reload, and writeback.

Changes

HUF Decode Pipeline Optimization

Layer / File(s)	Summary
Counting-sort table fill in `build_table_from_weights` `zstd/src/huff0/huff0_decoder.rs`	Introduces a `sorted` symbol array filled via `sym_off` prefix-sum cursors, then fills `packed_decode` by iterating code-length ranks and dispatching on `run_len` to specialized unsafe contiguous store strategies via `packed64`. Coverage tracking changes to `count * run_len` per group, and the post-fill safety comment is updated to reference the rank-walk partition invariant.
4-stream burst decode scalarized to locals and local macros `zstd/src/decoding/literals_section_decoder.rs`	Removes `burst_decode_symbols` helper. Replaces array-based per-stream state with scalar locals (`b0..b3`, `ip0..ip3`, `nbl0..nbl3`, `c0..c3`, `src0..src3`) and introduces `decode1!`, `reload1!`, `burst!`, and `writeback!` local macro rules. The burst loop dispatches on `symbols_per_burst` for compile-time unrolling, and post-burst reload/cursor commit/writeback are all rewritten in terms of scalar locals.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

perf(decode + encode-greedy): close 3-5× donor gap on negative-level decompress; share SIMD primitives + add dedicated greedy strategy #178: Directly implements the HUF 4-stream burst decompress refactoring identified as the primary performance lever, including inlining the burst decoding path with macro helpers and scalar locals, plus related lookup-table optimization in the HUF decoder.

Possibly related PRs

structured-world/structured-zstd#201: Modifies the same 4-stream HUF burst decoding hot path in literals_section_decoder.rs, including per-stream state/bits reload and writeback mechanics directly overlapping with this PR's burst-loop refactor.
structured-world/structured-zstd#243: Modifies both run_4stream_burst_loop reload/writeback logic and HuffmanTable::build_table_from_weights packed table representation in the same areas touched by this PR.
structured-world/structured-zstd#403: Refactors HuffmanTable::build_table_from_weights decode lookup-table fill logic in huff0_decoder.rs, directly overlapping with the counting-sort table fill changes in this PR.

Poem

🐇 Four streams of bits, I hop along,
No arrays to slow my little paws!
Scalar locals carry me strong,
decode1! fires without a pause.
The rank-sort table fills with grace —
Each symbol finds its sorted place. ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the two main performance optimizations: HUF table-fill by code-length group and scalarization of 4-stream burst decoding, directly matching the primary changes in the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/huf-dtable-build

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-21T09:57:03Z

Codecov Report

❌ Patch coverage is 99.15966% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/huff0/huff0_decoder.rs	98.71%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

greptile-apps · 2026-06-21T09:57:09Z

Greptile Summary

This PR delivers two independent instruction-count reductions on the HUF decode hot path: a counting-sort + group-fill rewrite of the Huffman decode-table builder (build_decoder), and full scalarisation of the 4-stream burst loop state from arrays/references into named register-resident locals.

Group fill (huff0_decoder.rs): replaces the per-symbol variable-run-length while loop with a prefix-sum counting sort followed by a match run_len { 1 | 2 | 4 | 8 | _ } dispatch that emits specialised straight-line stores per group, mirroring the upstream zstd HUF_readDTableX1_wksp shape.
Scalarised burst (literals_section_decoder.rs): the previous [u64; 4] / [usize; 4] per-stream arrays (including the cursors reference that blocked register promotion) are replaced by b0..b3, ip0..ip3, nbl0..nbl3, c0..c3, and src0..src3 locals; burst_decode_symbols is removed in favour of three lightweight macros (decode1!, reload1!, burst!). Cursor write-back is moved unconditionally before the any_iter early-return (a semantic no-op since the locals are initialised from the input array).

Confidence Score: 5/5

Safe to merge; byte-identical output is verified by the 23/23 cross-validation suite and 823/823 unit tests, and both changes are mechanically equivalent rewrites of well-understood hot-path patterns.

Both changes are refactors of existing hot-path logic with no semantic changes to the decode protocol. The group-fill correctly replicates the per-symbol table assignments (the rank_start indexing, prefix-sum sym_off, and sorted counting-sort all line up). The scalarisation preserves the reload order, sentinel composition, and writeback of bits_consumed = nbl + max_num_bits. No new unsafe invariants are introduced beyond those already present; all existing guards (filled == table_size, per-group run assert, min_ip >= bytes_per_iter_upper) are retained.

No files require special attention; the fallback _ arm in the match run_len block in huff0_decoder.rs relies on run_len being a power-of-two >= 16 (only a debug_assert!), which holds by construction but is worth keeping in mind if the surrounding bit-length arithmetic ever changes.

Important Files Changed

Filename	Overview
zstd/src/huff0/huff0_decoder.rs	Replaces per-symbol run-fill with a counting-sort + group-fill shape (porting upstream zstd HUF_readDTableX1_wksp). Logic is correct: prefix-sum builds sym_off, counting sort places symbols in canonical order, and the match on run_len emits specialised straight-line stores per group. One minor structural concern in the fallback arm (see comment).
zstd/src/decoding/literals_section_decoder.rs	Scalarises the 4-stream burst state from arrays/references into named locals (b0..b3, ip0..ip3, nbl0..nbl3, c0..c3, src0..src3), replacing the old burst_decode_symbols generic function with decode1!/reload1!/burst!/writeback! macros. Semantics are preserved: cursors are written back before the any_iter early return (a no-op when no iter ran), and the writeback macro reuses the captured src* slice references correctly.

_{Reviews (2): Last reviewed commit: "test(huff0): assert fallback-arm run_len..." | Re-trigger Greptile}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/huff0/huff0_decoder.rs`:
- Around line 881-905: The fallback arm (the underscore match arm) in the
decoder assumes that run_len is at least 16 and is divisible by 16 due to the
explicit match cases handling 1, 2, 4, and 8, but this precondition is not
explicitly validated. Add a debug_assert at the beginning of the fallback match
arm that checks both conditions: that run_len is greater than or equal to 16 AND
that run_len is evenly divisible by 16. This will help catch any future
refactoring that changes the match structure and accidentally violates these
assumptions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 30d5cc75-230c-4a9c-8c12-686cb40fe240

📥 Commits

Reviewing files that changed from the base of the PR and between 0155a21 and 9dc916c.

📒 Files selected for processing (2)

zstd/src/decoding/literals_section_decoder.rs
zstd/src/huff0/huff0_decoder.rs

Guard the 16-entry-per-iteration fallback arm with a debug_assert documenting its invariant (run_len >= 16 and divisible by 16), so a future refactor that changes the 1/2/4/8 match arms trips loudly instead of writing past slot bounds.

polaz added 3 commits June 21, 2026 00:49

style(decode): iterate HUF rank prefix sum (clippy needless_range_loop)

9dc916c

coderabbitai Bot reviewed Jun 21, 2026

View reviewed changes

Comment thread zstd/src/huff0/huff0_decoder.rs

polaz merged commit 198ae74 into main Jun 21, 2026
28 checks passed

polaz deleted the perf/huf-dtable-build branch June 21, 2026 13:19

sw-release-bot Bot mentioned this pull request Jun 21, 2026

chore: release v0.0.43 #438

Open

polaz mentioned this pull request Jun 21, 2026

perf(decode): branchy fused decodeSequence on the AVX2 short arm #439

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(decode): HUF table-fill by code-length group + scalarize 4-stream burst#437

perf(decode): HUF table-fill by code-length group + scalarize 4-stream burst#437
polaz merged 4 commits into
mainfrom
perf/huf-dtable-build

polaz commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

codecov Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 21, 2026 •

edited

Loading

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

polaz commented Jun 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Measured impact (i9-9900K, z000033 decodecorpus, 8000 iters)

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

codecov Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

greptile-apps Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

polaz commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading

codecov Bot commented Jun 21, 2026 •

edited

Loading

greptile-apps Bot commented Jun 21, 2026 •

edited

Loading