perf(decode): HUF table-fill by code-length group + scalarize 4-stream burst#437
Conversation
Port the upstream zstd HUF_readDTableX1_wksp fill shape: counting-sort the symbols by code length, then fill the decode table one length GROUP at a time instead of one symbol at a time. Within a group the run length (1 << (max_bits - len)) and the entry nbBits are constant, so a match on the run length hoists out of the symbol loop and the optimiser emits a straight-line specialised store per group (1/2 scalar, 4/8/16+ via 64-bit broadcast). The old per-symbol form recomputed a variable run length and used a runtime-trip-count while loop that blocked unrolling.
The 4-stream literal burst held its per-stream state (bits, ip, nb_bits_last, cursors) in [_; 4] arrays. cursors arrived through a &mut [usize; 4] reference (caller-owned), so it could never be promoted to registers; every decoded symbol did a memory RMW on cursors[s], and the array bits/ip reloaded from the stack too. Profiling z000033 decode showed the burst at ~23.6 instructions per literal symbol versus libzstd's hand-written 4X1 fast loop at ~1.1 — a ~20x gap, the single largest decodecorpus decode divergence. Scalarize all four per-stream registers into named locals (b0..b3, ip0..ip3, nbl0..nbl3, c0..c3) so the optimiser keeps them in registers across the whole burst+reload, matching the upstream register-resident loop. Cursors are copied in at entry and written back before the drain. Byte-identical output (823 lib + 23 cross-validation tests).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PR optimizes two components of the HUF decode pipeline. In ChangesHUF Decode Pipeline Optimization
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
| Filename | Overview |
|---|---|
| zstd/src/huff0/huff0_decoder.rs | Replaces per-symbol run-fill with a counting-sort + group-fill shape (porting upstream zstd HUF_readDTableX1_wksp). Logic is correct: prefix-sum builds sym_off, counting sort places symbols in canonical order, and the match on run_len emits specialised straight-line stores per group. One minor structural concern in the fallback arm (see comment). |
| zstd/src/decoding/literals_section_decoder.rs | Scalarises the 4-stream burst state from arrays/references into named locals (b0..b3, ip0..ip3, nbl0..nbl3, c0..c3, src0..src3), replacing the old burst_decode_symbols generic function with decode1!/reload1!/burst!/writeback! macros. Semantics are preserved: cursors are written back before the any_iter early return (a no-op when no iter ran), and the writeback macro reuses the captured src* slice references correctly. |
Reviews (2): Last reviewed commit: "test(huff0): assert fallback-arm run_len..." | Re-trigger Greptile
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/huff0/huff0_decoder.rs`:
- Around line 881-905: The fallback arm (the underscore match arm) in the
decoder assumes that run_len is at least 16 and is divisible by 16 due to the
explicit match cases handling 1, 2, 4, and 8, but this precondition is not
explicitly validated. Add a debug_assert at the beginning of the fallback match
arm that checks both conditions: that run_len is greater than or equal to 16 AND
that run_len is evenly divisible by 16. This will help catch any future
refactoring that changes the match structure and accidentally violates these
assumptions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 30d5cc75-230c-4a9c-8c12-686cb40fe240
📒 Files selected for processing (2)
zstd/src/decoding/literals_section_decoder.rszstd/src/huff0/huff0_decoder.rs
Guard the 16-entry-per-iteration fallback arm with a debug_assert documenting its invariant (run_len >= 16 and divisible by 16), so a future refactor that changes the 1/2/4/8 match arms trips loudly instead of writing past slot bounds.
Summary
Two structural instruction-count reductions on the decodecorpus decode hot path, validated cycle-for-cycle on i9-9900K (z000033, perf stat, flat c_ffi control).
HUF decode-table fill by code-length group. Port the upstream zstd
HUF_readDTableX1_wkspfill shape: counting-sort the symbols by code length, then fill the table one length GROUP at a time. Within a group the run length and entrynbBitsare constant, so amatchon the run length hoists out of the symbol loop and the optimiser emits a straight-line specialised store per group (1/2 scalar, 4/8/16+ via 64-bit broadcast). The previous per-symbol form recomputed a variable run length in a runtime-trip-count loop that blocked unrolling.build_decoderself-time drops 9.6% to 2.4%.Scalarize the 4-stream HUF burst registers. The literal burst held its per-stream state in
[_; 4]arrays;cursorsarrived through a&mut [usize; 4]reference (caller-owned) so it could never be promoted to registers, and every decoded symbol did a memory RMW. Scalarize all four streams into named locals so they stay register-resident across the whole burst+reload, matching the upstream register-resident loop.Measured impact (i9-9900K, z000033 decodecorpus, 8000 iters)
Net -7.2% wall on decodecorpus decode. Both changes are kernel-independent (scalar code), so the win carries to every CPU tier. Instruction-count driven (our IPC already exceeds libzstd on this path); the cycle/wall delta tracks the instruction reduction.
Testing
cargo nextest run -p structured-zstd --features std,hash,dict_builder: 823/823cargo nextest run -p ffi-bench --test cross_validation: 23/23 (byte-identical output)cargo clippy --features hash,std,dict_builder -- -D warnings: cleanOutput is byte-identical pre/post (decode tables + burst produce the same bytes).
Summary by CodeRabbit