ci(benchmark): paired interleaved A/B + bootstrap CI for drift-robust comparison#327
ci(benchmark): paired interleaved A/B + bootstrap CI for drift-robust comparison#327RealiCZ wants to merge 6 commits into
Conversation
… comparison Replace the sequential two-pass benchmark comparison (whole feature suite, then whole baseline suite, compared as two single point estimates) with a paired, order-interleaved A/B harness aggregated by a seeded bootstrap CI. - Build both commits' bench binaries up front (cargo bench --no-run), then run them back-to-back for R short rounds (A B / B A / ...) on the same runner so per-round paired deltas cancel the runner's slow drift. - Aggregate with mean Δ% + 95% bootstrap CI per bench; flag a change only when the CI excludes 0 AND |Δ| clears the bench's A/A noise floor. Stats ported verbatim from ARO's significance gate. - Add an A/A self-check (workflow_dispatch input) that compares HEAD to itself. - Keep the per-target matrix and the baseline-gap / regression tables; add CI and A/A-floor columns and a significance legend.
There was a problem hiding this comment.
Clean review
- Reviewed the full PR: paired-interleaved A/B harness (bench_paired.sh), bootstrap-CI aggregation (bench_compare.py), and the workflow rewiring (benchmark.yml).
- No new actionable findings. Methodology is sound (paired delta percent cancels common-mode drift; CI-excludes-0 AND absolute delta greater than A/A floor is the right two-gate rule); the AA self-check is wired in as a workflow_dispatch input and the PR descriptions empirical AA plus injected-shift table is convincing validation.
- Security boundaries preserved: bench_paired.sh is stashed to RUNNER_TEMP from the default-branch checkout before any git checkout of feature_sha, the bench job stays contents read / pull-requests read, and the compare job re-checks out the base repo so it always runs the trusted bench_compare.py. issues write added to prepare only for the rocket reaction, which is now best-effort wrapped in try/catch so a reaction failure no longer kills the bench + compare jobs.
- Correctness spot-checks: the jq filter on target.kind works (0 is truthy in jq, null is falsy); the odd/even round alternation matches the documented A B / B A pattern; for the AA dispatch feature_sha and baseline_sha both equal context.sha so the Build baseline step takes the cp short-circuit; bench group names in crates/mega-evm/benches/ do not collide across the seven targets, so load_rounds merging is safe.
- Still open from earlier reviews: none.
Address two review findings on the paired-interleaved harness: - BENCH_ROUNDS default 15 -> 16. The within-pair order alternates each round, so an odd count runs one side first one extra time, leaving an s/R order-effect residual in the mean (and in the A/A self-check that is supposed to read ~0). An even count balances feature-first / baseline-first exactly. bench_paired.sh now also rejects an odd or < 2 round count at runtime, so a misconfigured BENCH_ROUNDS fails fast instead of silently biasing. - noise_floor now compares only same-slot rounds (odd-with-odd, even-with-even) instead of consecutive rounds. Because the order alternates every round, the feature/baseline series are [first, second, first, ...]; consecutive deltas straddle the first-vs-second slot and measure the slot-order effect (which the paired mean already cancels), not random jitter. The straddling floor inflated to roughly the slot amplitude and could reclassify a genuine sub-5% change as noise-limited on a thermal/contended runner. Same-slot pairing keeps the floor measuring jitter.
🧬 Mutation testing — ✅ PASSNothing to test — no mutants were generated on the changed lines. |
🧬 Mutation testingNo results at |
Codecov Report✅ All modified and coverable lines are covered by tests. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧭 Re-review update — Clean
- Reviewed the incremental diff since my last review (fb05dcb → a230ae5): even-rounds default + runtime validation in
bench_paired.sh, slot-aware same-parity pairing inbench_compare.py:noise_floor, and theBENCH_ROUNDSenv bump 15 → 16 inbenchmark.yml. - No new actionable findings. The arithmetic on the new noise-floor checks out: at R=16 each phase has 8 same-slot rounds → 7 same-parity deltas per phase per parity → 28 magnitudes feeding the q90 floor, well above the
<2fallback threshold; the script-level rejection of odd/<2makes a misconfiguredBENCH_ROUNDSfail loud at the next round instead of silently biasing. - The script convention comment (odd = feature-first, even = baseline-first) is now explicit and cross-references
bench_compare.py:noise_floor, so the two stay coupled.
Still open from earlier reviews (human-authored, left unresolved for the author to close):
- BENCH_ROUNDS odd-bias finding — appears addressed by default 16 + runtime even/≥2 validation in
bench_paired.sh:34-37. noise_floorslot-vs-jitter finding — appears addressed by the same-parity pairing inbench_compare.py:195-200.
There was a problem hiding this comment.
Clean - no new actionable findings.
Reviewed the CI-only change: paired interleaved A/B benchmark harness (bench_paired.sh), bootstrap-CI aggregator (bench_compare.py), and reworked benchmark.yml.
Design holds up on careful re-read: drift cancels via per-round pairing, the CI-excludes-zero and above-floor gate is sound, and the trust boundary is preserved (measurement script stashed before PR checkout; bench job stays read-only; compare re-checks out the default branch).
A few small things I considered and decided not to flag: quantile clamp at hi >= n is unreachable under valid inputs but harmless defense; BOOTSTRAP_ITERS=0 makes bootstrap_ci collapse to (mean, mean) so any nonzero mean trivially excludes zero, but the default is 2000 and this is an explicit override; using the feature working tree for both binaries is a deliberate identical-inputs trade-off and is documented.
Still open from earlier reviews (both appear addressed by commit a230ae5; left to the original reviewer to confirm and resolve):
- Odd default BENCH_ROUNDS=15 biases the A/A oracle (#327 (comment)) — default is now 16, and bench_paired.sh:34 rejects odd/<2 counts at runtime.
- noise_floor conflates slot-order effect with jitter (#327 (comment)) — noise_floor now pairs same-parity rounds (same execution slot) at bench_compare.py lines 194-204, with a doc comment explaining the rationale.
vincent-k2026
left a comment
There was a problem hiding this comment.
Two benchmark comparison parser issues.
…on keys Address two aggregator review findings: - baseline_gap_section split the criterion name on the LAST '/', so a name with a per-row variant suffix (<group>/<spec>/<variant>, e.g. sstore_heavy/revm_pinned/existing_account_50 from register_all_suffixed) had its row parsed as the variant, leaving the group with no 'revm_pinned' row and silently dropping it from the table. Locate the spec segment by name (split_spec) instead of assuming position. - load_rounds keyed by bench name alone. All per-target artifacts are downloaded into one directory, so two targets emitting the same criterion name would overwrite each other. Key by (target, bench_name) end-to-end.
There was a problem hiding this comment.
🧭 Re-review update — ✅ Clean
- Scope this round: the aggregation fixes in
7220c7b—split_specin.github/scripts/bench_compare.pyfor suffixed criterion names, and keyingload_rounds/compareby(target, name)so per-target artifacts merged into one directory no longer overwrite on collisions. - Both fixes look correct.
split_specfinds the spec segment by name lookup, sosstore_heavy/revm_pinned/existing_account_50parses asgroup=sstore_heavy/existing_account_50, spec=revm_pinnedand keeps its baseline row.bootstrap_ciis now seeded per (target, name), andbaseline_gap_sectiongroups by (target, group), so the disambiguation is consistent end-to-end. - I scanned the seven bench crates for cross-target group-name collisions and found none today (
revm_bench/mega_benchetc. all use distinct group names), so the per-target key is future-proofing rather than fixing an active bug — non-blocking. - No new actionable findings.
Still open from earlier reviews (both marked outdated by GitHub, addressed by 7220c7b, safe for the author to resolve):
- baseline-gap suffixed-name parsing — addressed by
split_spec. - per-target aggregation key — addressed by keying
load_roundsby (target, bench_name).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7220c7b084
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The (target, bench_name) key stopped cross-target overwrite, but Comparison dropped the target again, so two targets sharing a criterion name would render as two identical rows (and the baseline-gap section as duplicate headers) with no way to tell which target the number belongs to. Carry target into Comparison and prefix it (target/name) in the regression table and the baseline-gap header ONLY when a name/group is shared across targets, so the common no-collision output is unchanged.
The bench job uploads path: rounds/, and for a comparison run only that path has files, so upload-artifact collapses the rounds/ prefix and the files land directly under the download root (all_rounds) — not all_rounds/rounds. The comparator pointed ROUNDS_DIR at all_rounds/rounds (created empty by mkdir), so every comparison posted 'No comparable benchmarks found'. Point ROUNDS_DIR at all_rounds and glob recursively so the files are found whether or not the prefix survives.
There was a problem hiding this comment.
✅ Clean
- Reviewed the paired-A/B benchmark harness (
.github/workflows/benchmark.yml,.github/scripts/bench_paired.sh,.github/scripts/bench_compare.py) at HEAD0c53252. - All 6 prior threads (odd rounds, slot-conflated noise floor,
baseline_gapsplit bug, target dropped in aggregation, output name collisions,ROUNDS_DIRmismatch) are addressed in the current code: even default with runtime reject inbench_paired.sh, same-parity noise floor,split_spec()locating the spec segment by name,(target, name)keys end-to-end, target-prefixed labels only on collision, and a recursive glob rooted atall_rounds. - No new actionable findings this round.
- Still open from earlier reviews: none.
Summary
The
/benchmarkharness compares two commits by running the whole feature suite, then the whole baseline suite, and diffing two single point estimates (--output-format bencher). The two passes are minutes apart, so the runner's slow drift — thermal throttling, frequency scaling, noisy neighbours — lands entirely on the feature-vs-baseline axis and reads as a ±2–5% phantom delta. That noise floor is large enough to invent regressions on code a PR never touched and to bury a genuine sub-5% optimization in the same band. It is not a sample-count problem (the heaviest benches already run with their tuned sample sizes) — it is drift between the two segments, which more samples within a segment cannot cancel.This replaces the sequential single-point comparison with the standard methodology for cross-commit microbenchmarking: paired, order-interleaved A/B with a bootstrap confidence interval.
cargo bench --no-run); no rebuilds happen during measurement.A B / B A / …), each with a short criterion config, so the per-round paired Δ% cancels the common-mode drift the two-segment layout could not.What this buys, concretely: drift — the fixable, dominant term in the old ±2–5% floor — is removed, so untouched rows stop showing phantom regressions, and the residual floor drops to each bench's intrinsic measurement jitter (≈1% on the larger benches, higher on sub-µs ones). Every verdict becomes self-describing: a flag means a change resolved above that bench's noise, a non-flag means the effect is genuinely below it — neither is drift masquerading as signal. A real sub-5% change becomes detectable wherever the bench's intrinsic floor permits, which the old drift-dominated floor effectively prevented.
This is a measurement-methodology change, not one tailored to any single result: no benchmark definition is added or modified (
crates/mega-evm/benches/*is untouched), only how two commits are compared — so every small-optimization PR benefits. The statistics (median / quantile / seeded bootstrap CI) are ported verbatim from ARO's significance gate. The per-target job matrix and the baseline-gap / regression tables are preserved; the verdict moves from a bare-% threshold to the CI-plus-floor gate, with added CI and A/A-floor columns and a significance legend. Rounds and per-round config (sample size / warm-up / measurement time) are tunable in the workflow's top-levelenv:, and thebenchjob timeout is raised to 180 min for headroom. Anaa_checkworkflow_dispatchinput runs the harness against HEAD versus itself as an on-runner noise-floor sanity check.Test plan
The A/A self-check is the correctness oracle: comparing a commit against itself must flag nothing, and the spread it then measures is the pure noise floor. It is wired in as a
workflow_dispatchinput (aa_check) and was validated locally by running the harness with the sametransactbench binary on both sides on an idle machine.Noise floor (A/A, same binary both sides, R = 20):
Every row reports |mean Δ| < 1% with a 95% CI that contains 0 — 0 significant changes, identically at R = 12 / 15 / 20. The old harness's ±2–5% drift floor collapses to ≈0% ± ~1% once the two sides are paired.
Detection power (same rounds; a known shift injected into the feature side only, so the effect is exact and the noise is unchanged):
Robustness of the gate: under deliberate heavy CPU contention, a CI-only rule produced borderline false 🚀 (|Δ| ≈ 1.8%, CI just past 0); the A/A floor reclassifies those as 👀 noise-limited (0 significant), which is why the gate is CI-excludes-0 and above-floor.
Note: GitHub runs
issue_comment/workflow_dispatchworkflows from the default branch, so the new harness takes effect after this merges; theaa_checkdispatch can then confirm the noise floor on the real CI runner.