Skip to content

ci(benchmark): paired interleaved A/B + bootstrap CI for drift-robust comparison#327

Open
RealiCZ wants to merge 6 commits into
mainfrom
cz/ci/paired-bench-bootstrap-ci
Open

ci(benchmark): paired interleaved A/B + bootstrap CI for drift-robust comparison#327
RealiCZ wants to merge 6 commits into
mainfrom
cz/ci/paired-bench-bootstrap-ci

Conversation

@RealiCZ

@RealiCZ RealiCZ commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

The /benchmark harness compares two commits by running the whole feature suite, then the whole baseline suite, and diffing two single point estimates (--output-format bencher). The two passes are minutes apart, so the runner's slow drift — thermal throttling, frequency scaling, noisy neighbours — lands entirely on the feature-vs-baseline axis and reads as a ±2–5% phantom delta. That noise floor is large enough to invent regressions on code a PR never touched and to bury a genuine sub-5% optimization in the same band. It is not a sample-count problem (the heaviest benches already run with their tuned sample sizes) — it is drift between the two segments, which more samples within a segment cannot cancel.

This replaces the sequential single-point comparison with the standard methodology for cross-commit microbenchmarking: paired, order-interleaved A/B with a bootstrap confidence interval.

  • Both commits' bench binaries are compiled up front (cargo bench --no-run); no rebuilds happen during measurement.
  • For R rounds, the feature and baseline binaries run back-to-back on the same runner (A B / B A / …), each with a short criterion config, so the per-round paired Δ% cancels the common-mode drift the two-segment layout could not.
  • Each bench's R paired Δ% are aggregated into a mean Δ% and a seeded 95% bootstrap CI; a change is flagged only when the CI excludes 0 AND |Δ| clears that bench's own A/A noise floor (its round-to-round jitter). Either gate alone is a known false-positive source.

What this buys, concretely: drift — the fixable, dominant term in the old ±2–5% floor — is removed, so untouched rows stop showing phantom regressions, and the residual floor drops to each bench's intrinsic measurement jitter (≈1% on the larger benches, higher on sub-µs ones). Every verdict becomes self-describing: a flag means a change resolved above that bench's noise, a non-flag means the effect is genuinely below it — neither is drift masquerading as signal. A real sub-5% change becomes detectable wherever the bench's intrinsic floor permits, which the old drift-dominated floor effectively prevented.

This is a measurement-methodology change, not one tailored to any single result: no benchmark definition is added or modified (crates/mega-evm/benches/* is untouched), only how two commits are compared — so every small-optimization PR benefits. The statistics (median / quantile / seeded bootstrap CI) are ported verbatim from ARO's significance gate. The per-target job matrix and the baseline-gap / regression tables are preserved; the verdict moves from a bare-% threshold to the CI-plus-floor gate, with added CI and A/A-floor columns and a significance legend. Rounds and per-round config (sample size / warm-up / measurement time) are tunable in the workflow's top-level env:, and the bench job timeout is raised to 180 min for headroom. An aa_check workflow_dispatch input runs the harness against HEAD versus itself as an on-runner noise-floor sanity check.

Test plan

The A/A self-check is the correctness oracle: comparing a commit against itself must flag nothing, and the spread it then measures is the pure noise floor. It is wired in as a workflow_dispatch input (aa_check) and was validated locally by running the harness with the same transact bench binary on both sides on an idle machine.

Noise floor (A/A, same binary both sides, R = 20):

Benchmark Δ (mean) 95% CI verdict
empty_transaction/equivalence -0.2% [-1.3%, +0.8%] within noise
empty_transaction/mini_rex +0.0% [-1.0%, +1.0%] within noise
empty_transaction/op_revm_latest -0.7% [-2.0%, +0.3%] within noise
empty_transaction/op_revm_pinned -0.5% [-1.4%, +0.4%] within noise
empty_transaction/revm_latest -0.8% [-1.9%, +0.2%] within noise
empty_transaction/revm_pinned +0.3% [-0.6%, +1.3%] within noise
empty_transaction/rex4 +0.1% [-0.5%, +0.7%] within noise

Every row reports |mean Δ| < 1% with a 95% CI that contains 0 — 0 significant changes, identically at R = 12 / 15 / 20. The old harness's ±2–5% drift floor collapses to ≈0% ± ~1% once the two sides are paired.

Detection power (same rounds; a known shift injected into the feature side only, so the effect is exact and the noise is unchanged):

  • +6% slower → all rows flagged ⚠️ regression (CIs ≈ [+4%, +7%]).
  • −8% faster → all rows flagged 🚀 improvement (CIs ≈ [−9%, −7%]).
  • −3% faster → 🚀 on the benches whose A/A floor is < 3%, 👀 noise-limited on the noisier sub-µs rows — i.e. it resolves a real sub-5% change exactly where the bench's precision allows, instead of drowning it in drift.

Robustness of the gate: under deliberate heavy CPU contention, a CI-only rule produced borderline false 🚀 (|Δ| ≈ 1.8%, CI just past 0); the A/A floor reclassifies those as 👀 noise-limited (0 significant), which is why the gate is CI-excludes-0 and above-floor.

Note: GitHub runs issue_comment / workflow_dispatch workflows from the default branch, so the new harness takes effect after this merges; the aa_check dispatch can then confirm the noise floor on the real CI runner.

… comparison

Replace the sequential two-pass benchmark comparison (whole feature suite,
then whole baseline suite, compared as two single point estimates) with a
paired, order-interleaved A/B harness aggregated by a seeded bootstrap CI.

- Build both commits' bench binaries up front (cargo bench --no-run), then run
  them back-to-back for R short rounds (A B / B A / ...) on the same runner so
  per-round paired deltas cancel the runner's slow drift.
- Aggregate with mean Δ% + 95% bootstrap CI per bench; flag a change only when
  the CI excludes 0 AND |Δ| clears the bench's A/A noise floor. Stats ported
  verbatim from ARO's significance gate.
- Add an A/A self-check (workflow_dispatch input) that compares HEAD to itself.
- Keep the per-target matrix and the baseline-gap / regression tables; add CI
  and A/A-floor columns and a significance legend.
@RealiCZ RealiCZ added spec:unchanged No change to any `mega-evm`'s behavior comp:misc Changes to the miscellaneous part of this repo api:unchanged No change to the public interface or API labels Jun 29, 2026

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean review

  • Reviewed the full PR: paired-interleaved A/B harness (bench_paired.sh), bootstrap-CI aggregation (bench_compare.py), and the workflow rewiring (benchmark.yml).
  • No new actionable findings. Methodology is sound (paired delta percent cancels common-mode drift; CI-excludes-0 AND absolute delta greater than A/A floor is the right two-gate rule); the AA self-check is wired in as a workflow_dispatch input and the PR descriptions empirical AA plus injected-shift table is convincing validation.
  • Security boundaries preserved: bench_paired.sh is stashed to RUNNER_TEMP from the default-branch checkout before any git checkout of feature_sha, the bench job stays contents read / pull-requests read, and the compare job re-checks out the base repo so it always runs the trusted bench_compare.py. issues write added to prepare only for the rocket reaction, which is now best-effort wrapped in try/catch so a reaction failure no longer kills the bench + compare jobs.
  • Correctness spot-checks: the jq filter on target.kind works (0 is truthy in jq, null is falsy); the odd/even round alternation matches the documented A B / B A pattern; for the AA dispatch feature_sha and baseline_sha both equal context.sha so the Build baseline step takes the cp short-circuit; bench group names in crates/mega-evm/benches/ do not collide across the seven targets, so load_rounds merging is safe.
  • Still open from earlier reviews: none.

Comment thread .github/scripts/bench_paired.sh
Comment thread .github/scripts/bench_compare.py
Address two review findings on the paired-interleaved harness:

- BENCH_ROUNDS default 15 -> 16. The within-pair order alternates each round,
  so an odd count runs one side first one extra time, leaving an s/R order-effect
  residual in the mean (and in the A/A self-check that is supposed to read ~0).
  An even count balances feature-first / baseline-first exactly. bench_paired.sh
  now also rejects an odd or < 2 round count at runtime, so a misconfigured
  BENCH_ROUNDS fails fast instead of silently biasing.
- noise_floor now compares only same-slot rounds (odd-with-odd, even-with-even)
  instead of consecutive rounds. Because the order alternates every round, the
  feature/baseline series are [first, second, first, ...]; consecutive deltas
  straddle the first-vs-second slot and measure the slot-order effect (which the
  paired mean already cancels), not random jitter. The straddling floor inflated
  to roughly the slot amplitude and could reclassify a genuine sub-5% change as
  noise-limited on a thermal/contended runner. Same-slot pairing keeps the floor
  measuring jitter.
@github-actions

Copy link
Copy Markdown

🧬 Mutation testing — ✅ PASS

Nothing to test — no mutants were generated on the changed lines.

@github-actions

Copy link
Copy Markdown

🧬 Mutation testing

No results at target/mutants/mutants.out — nothing was mutated (e.g. no mutatable changes).

@codecov

codecov Bot commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.5%. Comparing base (5a2ea61) to head (0c53252).

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧭 Re-review update — Clean

  • Reviewed the incremental diff since my last review (fb05dcba230ae5): even-rounds default + runtime validation in bench_paired.sh, slot-aware same-parity pairing in bench_compare.py:noise_floor, and the BENCH_ROUNDS env bump 15 → 16 in benchmark.yml.
  • No new actionable findings. The arithmetic on the new noise-floor checks out: at R=16 each phase has 8 same-slot rounds → 7 same-parity deltas per phase per parity → 28 magnitudes feeding the q90 floor, well above the <2 fallback threshold; the script-level rejection of odd/<2 makes a misconfigured BENCH_ROUNDS fail loud at the next round instead of silently biasing.
  • The script convention comment (odd = feature-first, even = baseline-first) is now explicit and cross-references bench_compare.py:noise_floor, so the two stay coupled.

Still open from earlier reviews (human-authored, left unresolved for the author to close):

@RealiCZ RealiCZ requested a review from flyq June 30, 2026 08:02

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean - no new actionable findings.

Reviewed the CI-only change: paired interleaved A/B benchmark harness (bench_paired.sh), bootstrap-CI aggregator (bench_compare.py), and reworked benchmark.yml.

Design holds up on careful re-read: drift cancels via per-round pairing, the CI-excludes-zero and above-floor gate is sound, and the trust boundary is preserved (measurement script stashed before PR checkout; bench job stays read-only; compare re-checks out the default branch).

A few small things I considered and decided not to flag: quantile clamp at hi >= n is unreachable under valid inputs but harmless defense; BOOTSTRAP_ITERS=0 makes bootstrap_ci collapse to (mean, mean) so any nonzero mean trivially excludes zero, but the default is 2000 and this is an explicit override; using the feature working tree for both binaries is a deliberate identical-inputs trade-off and is documented.

Still open from earlier reviews (both appear addressed by commit a230ae5; left to the original reviewer to confirm and resolve):

  • Odd default BENCH_ROUNDS=15 biases the A/A oracle (#327 (comment)) — default is now 16, and bench_paired.sh:34 rejects odd/<2 counts at runtime.
  • noise_floor conflates slot-order effect with jitter (#327 (comment)) — noise_floor now pairs same-parity rounds (same execution slot) at bench_compare.py lines 194-204, with a doc comment explaining the rationale.

@vincent-k2026 vincent-k2026 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two benchmark comparison parser issues.

Comment thread .github/scripts/bench_compare.py Outdated
Comment thread .github/scripts/bench_compare.py Outdated
…on keys

Address two aggregator review findings:

- baseline_gap_section split the criterion name on the LAST '/', so a name with
  a per-row variant suffix (<group>/<spec>/<variant>, e.g.
  sstore_heavy/revm_pinned/existing_account_50 from register_all_suffixed) had
  its row parsed as the variant, leaving the group with no 'revm_pinned' row and
  silently dropping it from the table. Locate the spec segment by name (split_spec)
  instead of assuming position.
- load_rounds keyed by bench name alone. All per-target artifacts are downloaded
  into one directory, so two targets emitting the same criterion name would
  overwrite each other. Key by (target, bench_name) end-to-end.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧭 Re-review update — ✅ Clean

  • Scope this round: the aggregation fixes in 7220c7bsplit_spec in .github/scripts/bench_compare.py for suffixed criterion names, and keying load_rounds / compare by (target, name) so per-target artifacts merged into one directory no longer overwrite on collisions.
  • Both fixes look correct. split_spec finds the spec segment by name lookup, so sstore_heavy/revm_pinned/existing_account_50 parses as group=sstore_heavy/existing_account_50, spec=revm_pinned and keeps its baseline row. bootstrap_ci is now seeded per (target, name), and baseline_gap_section groups by (target, group), so the disambiguation is consistent end-to-end.
  • I scanned the seven bench crates for cross-target group-name collisions and found none today (revm_bench / mega_bench etc. all use distinct group names), so the per-target key is future-proofing rather than fixing an active bug — non-blocking.
  • No new actionable findings.

Still open from earlier reviews (both marked outdated by GitHub, addressed by 7220c7b, safe for the author to resolve):

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7220c7b084

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread .github/workflows/benchmark.yml Outdated
RealiCZ added 2 commits July 2, 2026 09:56
The (target, bench_name) key stopped cross-target overwrite, but Comparison
dropped the target again, so two targets sharing a criterion name would render
as two identical rows (and the baseline-gap section as duplicate headers) with
no way to tell which target the number belongs to. Carry target into Comparison
and prefix it (target/name) in the regression table and the baseline-gap header
ONLY when a name/group is shared across targets, so the common no-collision
output is unchanged.
The bench job uploads path: rounds/, and for a comparison run only that path
has files, so upload-artifact collapses the rounds/ prefix and the files land
directly under the download root (all_rounds) — not all_rounds/rounds. The
comparator pointed ROUNDS_DIR at all_rounds/rounds (created empty by mkdir),
so every comparison posted 'No comparable benchmarks found'. Point ROUNDS_DIR
at all_rounds and glob recursively so the files are found whether or not the
prefix survives.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Clean

  • Reviewed the paired-A/B benchmark harness (.github/workflows/benchmark.yml, .github/scripts/bench_paired.sh, .github/scripts/bench_compare.py) at HEAD 0c53252.
  • All 6 prior threads (odd rounds, slot-conflated noise floor, baseline_gap split bug, target dropped in aggregation, output name collisions, ROUNDS_DIR mismatch) are addressed in the current code: even default with runtime reject in bench_paired.sh, same-parity noise floor, split_spec() locating the spec segment by name, (target, name) keys end-to-end, target-prefixed labels only on collision, and a recursive glob rooted at all_rounds.
  • No new actionable findings this round.
  • Still open from earlier reviews: none.

@RealiCZ RealiCZ requested a review from vincent-k2026 July 2, 2026 02:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api:unchanged No change to the public interface or API comp:misc Changes to the miscellaneous part of this repo spec:unchanged No change to any `mega-evm`'s behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants