ci(benchmark): paired interleaved A/B + bootstrap CI for drift-robust comparison by RealiCZ · Pull Request #327 · megaeth-labs/mega-evm

RealiCZ · 2026-06-29T06:36:14Z

Summary

The /benchmark harness compares two commits by running the whole feature suite, then the whole baseline suite, and diffing two single point estimates (--output-format bencher). The two passes are minutes apart, so the runner's slow drift — thermal throttling, frequency scaling, noisy neighbours — lands entirely on the feature-vs-baseline axis and reads as a ±2–5% phantom delta. That noise floor is large enough to invent regressions on code a PR never touched and to bury a genuine sub-5% optimization in the same band. It is not a sample-count problem (the heaviest benches already run with their tuned sample sizes) — it is drift between the two segments, which more samples within a segment cannot cancel.

This replaces the sequential single-point comparison with the standard methodology for cross-commit microbenchmarking: paired, order-interleaved A/B with a bootstrap confidence interval.

Both commits' bench binaries are compiled up front (cargo bench --no-run); no rebuilds happen during measurement.
For R rounds, the feature and baseline binaries run back-to-back on the same runner (A B / B A / …), each with a short criterion config, so the per-round paired Δ% cancels the common-mode drift the two-segment layout could not.
Each bench's R paired Δ% are aggregated into a mean Δ% and a seeded 95% bootstrap CI; a change is flagged only when the CI excludes 0 AND |Δ| clears that bench's own A/A noise floor (its round-to-round jitter). Either gate alone is a known false-positive source.

What this buys, concretely: drift — the fixable, dominant term in the old ±2–5% floor — is removed, so untouched rows stop showing phantom regressions, and the residual floor drops to each bench's intrinsic measurement jitter (≈1% on the larger benches, higher on sub-µs ones). Every verdict becomes self-describing: a flag means a change resolved above that bench's noise, a non-flag means the effect is genuinely below it — neither is drift masquerading as signal. A real sub-5% change becomes detectable wherever the bench's intrinsic floor permits, which the old drift-dominated floor effectively prevented.

This is a measurement-methodology change, not one tailored to any single result: no benchmark definition is added or modified (crates/mega-evm/benches/* is untouched), only how two commits are compared — so every small-optimization PR benefits. The statistics (median / quantile / seeded bootstrap CI) are ported verbatim from ARO's significance gate. The per-target job matrix and the baseline-gap / regression tables are preserved; the verdict moves from a bare-% threshold to the CI-plus-floor gate, with added CI and A/A-floor columns and a significance legend. Rounds and per-round config (sample size / warm-up / measurement time) are tunable in the workflow's top-level env:, and the bench job timeout is raised to 180 min for headroom. An aa_check workflow_dispatch input runs the harness against HEAD versus itself as an on-runner noise-floor sanity check.

Test plan

The A/A self-check is the correctness oracle: comparing a commit against itself must flag nothing, and the spread it then measures is the pure noise floor. It is wired in as a workflow_dispatch input (aa_check) and was validated locally by running the harness with the same transact bench binary on both sides on an idle machine.

Noise floor (A/A, same binary both sides, R = 20):

Benchmark	Δ (mean)	95% CI	verdict
empty_transaction/equivalence	-0.2%	[-1.3%, +0.8%]	within noise
empty_transaction/mini_rex	+0.0%	[-1.0%, +1.0%]	within noise
empty_transaction/op_revm_latest	-0.7%	[-2.0%, +0.3%]	within noise
empty_transaction/op_revm_pinned	-0.5%	[-1.4%, +0.4%]	within noise
empty_transaction/revm_latest	-0.8%	[-1.9%, +0.2%]	within noise
empty_transaction/revm_pinned	+0.3%	[-0.6%, +1.3%]	within noise
empty_transaction/rex4	+0.1%	[-0.5%, +0.7%]	within noise

Every row reports |mean Δ| < 1% with a 95% CI that contains 0 — 0 significant changes, identically at R = 12 / 15 / 20. The old harness's ±2–5% drift floor collapses to ≈0% ± ~1% once the two sides are paired.

Detection power (same rounds; a known shift injected into the feature side only, so the effect is exact and the noise is unchanged):

+6% slower → all rows flagged ⚠️ regression (CIs ≈ [+4%, +7%]).
−8% faster → all rows flagged 🚀 improvement (CIs ≈ [−9%, −7%]).
−3% faster → 🚀 on the benches whose A/A floor is < 3%, 👀 noise-limited on the noisier sub-µs rows — i.e. it resolves a real sub-5% change exactly where the bench's precision allows, instead of drowning it in drift.

Robustness of the gate: under deliberate heavy CPU contention, a CI-only rule produced borderline false 🚀 (|Δ| ≈ 1.8%, CI just past 0); the A/A floor reclassifies those as 👀 noise-limited (0 significant), which is why the gate is CI-excludes-0 and above-floor.

Note: GitHub runs issue_comment / workflow_dispatch workflows from the default branch, so the new harness takes effect after this merges; the aa_check dispatch can then confirm the noise floor on the real CI runner.

… comparison Replace the sequential two-pass benchmark comparison (whole feature suite, then whole baseline suite, compared as two single point estimates) with a paired, order-interleaved A/B harness aggregated by a seeded bootstrap CI. - Build both commits' bench binaries up front (cargo bench --no-run), then run them back-to-back for R short rounds (A B / B A / ...) on the same runner so per-round paired deltas cancel the runner's slow drift. - Aggregate with mean Δ% + 95% bootstrap CI per bench; flag a change only when the CI excludes 0 AND |Δ| clears the bench's A/A noise floor. Stats ported verbatim from ARO's significance gate. - Add an A/A self-check (workflow_dispatch input) that compares HEAD to itself. - Keep the per-target matrix and the baseline-gap / regression tables; add CI and A/A-floor columns and a significance legend.

claude

Clean review

Reviewed the full PR: paired-interleaved A/B harness (bench_paired.sh), bootstrap-CI aggregation (bench_compare.py), and the workflow rewiring (benchmark.yml).
No new actionable findings. Methodology is sound (paired delta percent cancels common-mode drift; CI-excludes-0 AND absolute delta greater than A/A floor is the right two-gate rule); the AA self-check is wired in as a workflow_dispatch input and the PR descriptions empirical AA plus injected-shift table is convincing validation.
Security boundaries preserved: bench_paired.sh is stashed to RUNNER_TEMP from the default-branch checkout before any git checkout of feature_sha, the bench job stays contents read / pull-requests read, and the compare job re-checks out the base repo so it always runs the trusted bench_compare.py. issues write added to prepare only for the rocket reaction, which is now best-effort wrapped in try/catch so a reaction failure no longer kills the bench + compare jobs.
Correctness spot-checks: the jq filter on target.kind works (0 is truthy in jq, null is falsy); the odd/even round alternation matches the documented A B / B A pattern; for the AA dispatch feature_sha and baseline_sha both equal context.sha so the Build baseline step takes the cp short-circuit; bench group names in crates/mega-evm/benches/ do not collide across the seven targets, so load_rounds merging is safe.
Still open from earlier reviews: none.

Address two review findings on the paired-interleaved harness: - BENCH_ROUNDS default 15 -> 16. The within-pair order alternates each round, so an odd count runs one side first one extra time, leaving an s/R order-effect residual in the mean (and in the A/A self-check that is supposed to read ~0). An even count balances feature-first / baseline-first exactly. bench_paired.sh now also rejects an odd or < 2 round count at runtime, so a misconfigured BENCH_ROUNDS fails fast instead of silently biasing. - noise_floor now compares only same-slot rounds (odd-with-odd, even-with-even) instead of consecutive rounds. Because the order alternates every round, the feature/baseline series are [first, second, first, ...]; consecutive deltas straddle the first-vs-second slot and measure the slot-order effect (which the paired mean already cancels), not random jitter. The straddling floor inflated to roughly the slot amplitude and could reclassify a genuine sub-5% change as noise-limited on a thermal/contended runner. Same-slot pairing keeps the floor measuring jitter.

github-actions · 2026-06-30T07:52:29Z

🧬 Mutation testing — ✅ PASS

Nothing to test — no mutants were generated on the changed lines.

github-actions · 2026-06-30T07:53:20Z

🧬 Mutation testing

No results at target/mutants/mutants.out — nothing was mutated (e.g. no mutatable changes).

codecov · 2026-06-30T07:53:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.5%. Comparing base (5a2ea61) to head (0c53252).

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

claude

🧭 Re-review update — Clean

Reviewed the incremental diff since my last review (fb05dcb → a230ae5): even-rounds default + runtime validation in bench_paired.sh, slot-aware same-parity pairing in bench_compare.py:noise_floor, and the BENCH_ROUNDS env bump 15 → 16 in benchmark.yml.
No new actionable findings. The arithmetic on the new noise-floor checks out: at R=16 each phase has 8 same-slot rounds → 7 same-parity deltas per phase per parity → 28 magnitudes feeding the q90 floor, well above the <2 fallback threshold; the script-level rejection of odd/<2 makes a misconfigured BENCH_ROUNDS fail loud at the next round instead of silently biasing.
The script convention comment (odd = feature-first, even = baseline-first) is now explicit and cross-references bench_compare.py:noise_floor, so the two stay coupled.

Still open from earlier reviews (human-authored, left unresolved for the author to close):

BENCH_ROUNDS odd-bias finding — appears addressed by default 16 + runtime even/≥2 validation in bench_paired.sh:34-37.
noise_floor slot-vs-jitter finding — appears addressed by the same-parity pairing in bench_compare.py:195-200.

claude

Clean - no new actionable findings.

Reviewed the CI-only change: paired interleaved A/B benchmark harness (bench_paired.sh), bootstrap-CI aggregator (bench_compare.py), and reworked benchmark.yml.

Design holds up on careful re-read: drift cancels via per-round pairing, the CI-excludes-zero and above-floor gate is sound, and the trust boundary is preserved (measurement script stashed before PR checkout; bench job stays read-only; compare re-checks out the default branch).

A few small things I considered and decided not to flag: quantile clamp at hi >= n is unreachable under valid inputs but harmless defense; BOOTSTRAP_ITERS=0 makes bootstrap_ci collapse to (mean, mean) so any nonzero mean trivially excludes zero, but the default is 2000 and this is an explicit override; using the feature working tree for both binaries is a deliberate identical-inputs trade-off and is documented.

Still open from earlier reviews (both appear addressed by commit a230ae5; left to the original reviewer to confirm and resolve):

Odd default BENCH_ROUNDS=15 biases the A/A oracle (#327 (comment)) — default is now 16, and bench_paired.sh:34 rejects odd/<2 counts at runtime.
noise_floor conflates slot-order effect with jitter (#327 (comment)) — noise_floor now pairs same-parity rounds (same execution slot) at bench_compare.py lines 194-204, with a doc comment explaining the rationale.

vincent-k2026

Two benchmark comparison parser issues.

…on keys Address two aggregator review findings: - baseline_gap_section split the criterion name on the LAST '/', so a name with a per-row variant suffix (<group>/<spec>/<variant>, e.g. sstore_heavy/revm_pinned/existing_account_50 from register_all_suffixed) had its row parsed as the variant, leaving the group with no 'revm_pinned' row and silently dropping it from the table. Locate the spec segment by name (split_spec) instead of assuming position. - load_rounds keyed by bench name alone. All per-target artifacts are downloaded into one directory, so two targets emitting the same criterion name would overwrite each other. Key by (target, bench_name) end-to-end.

claude

🧭 Re-review update — ✅ Clean

Scope this round: the aggregation fixes in 7220c7b — split_spec in .github/scripts/bench_compare.py for suffixed criterion names, and keying load_rounds / compare by (target, name) so per-target artifacts merged into one directory no longer overwrite on collisions.
Both fixes look correct. split_spec finds the spec segment by name lookup, so sstore_heavy/revm_pinned/existing_account_50 parses as group=sstore_heavy/existing_account_50, spec=revm_pinned and keeps its baseline row. bootstrap_ci is now seeded per (target, name), and baseline_gap_section groups by (target, group), so the disambiguation is consistent end-to-end.
I scanned the seven bench crates for cross-target group-name collisions and found none today (revm_bench / mega_bench etc. all use distinct group names), so the per-target key is future-proofing rather than fixing an active bug — non-blocking.
No new actionable findings.

Still open from earlier reviews (both marked outdated by GitHub, addressed by 7220c7b, safe for the author to resolve):

baseline-gap suffixed-name parsing — addressed by split_spec.
per-target aggregation key — addressed by keying load_rounds by (target, bench_name).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7220c7b084

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The (target, bench_name) key stopped cross-target overwrite, but Comparison dropped the target again, so two targets sharing a criterion name would render as two identical rows (and the baseline-gap section as duplicate headers) with no way to tell which target the number belongs to. Carry target into Comparison and prefix it (target/name) in the regression table and the baseline-gap header ONLY when a name/group is shared across targets, so the common no-collision output is unchanged.

The bench job uploads path: rounds/, and for a comparison run only that path has files, so upload-artifact collapses the rounds/ prefix and the files land directly under the download root (all_rounds) — not all_rounds/rounds. The comparator pointed ROUNDS_DIR at all_rounds/rounds (created empty by mkdir), so every comparison posted 'No comparable benchmarks found'. Point ROUNDS_DIR at all_rounds and glob recursively so the files are found whether or not the prefix survives.

claude

✅ Clean

Reviewed the paired-A/B benchmark harness (.github/workflows/benchmark.yml, .github/scripts/bench_paired.sh, .github/scripts/bench_compare.py) at HEAD 0c53252.
All 6 prior threads (odd rounds, slot-conflated noise floor, baseline_gap split bug, target dropped in aggregation, output name collisions, ROUNDS_DIR mismatch) are addressed in the current code: even default with runtime reject in bench_paired.sh, same-parity noise floor, split_spec() locating the spec segment by name, (target, name) keys end-to-end, target-prefixed labels only on collision, and a recursive glob rooted at all_rounds.
No new actionable findings this round.
Still open from earlier reviews: none.

RealiCZ requested review from Troublor, flyq and vincent-k2026 as code owners June 29, 2026 06:36

RealiCZ added spec:unchanged No change to any `mega-evm`'s behavior comp:misc Changes to the miscellaneous part of this repo api:unchanged No change to the public interface or API labels Jun 29, 2026

claude Bot reviewed Jun 29, 2026

View reviewed changes

RealiCZ mentioned this pull request Jun 30, 2026

perf(mega-evm): sload — ARO byte-identical optimization (+4.48%) #326

Open

flyq reviewed Jun 30, 2026

View reviewed changes

Comment thread .github/scripts/bench_paired.sh

Comment thread .github/scripts/bench_compare.py

Merge branch 'main' into cz/ci/paired-bench-bootstrap-ci

151102e

claude Bot reviewed Jun 30, 2026

View reviewed changes

RealiCZ requested a review from flyq June 30, 2026 08:02

claude Bot reviewed Jun 30, 2026

View reviewed changes

flyq approved these changes Jun 30, 2026

View reviewed changes

vincent-k2026 reviewed Jul 2, 2026

View reviewed changes

Comment thread .github/scripts/bench_compare.py Outdated

Comment thread .github/scripts/bench_compare.py Outdated

claude Bot reviewed Jul 2, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread .github/workflows/benchmark.yml Outdated

RealiCZ added 2 commits July 2, 2026 09:56

claude Bot reviewed Jul 2, 2026

View reviewed changes

RealiCZ requested a review from vincent-k2026 July 2, 2026 02:36

Uh oh!

Conversation

RealiCZ commented Jun 29, 2026

Summary

Test plan

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 30, 2026

🧬 Mutation testing — ✅ PASS

Uh oh!

github-actions Bot commented Jun 30, 2026

🧬 Mutation testing

Uh oh!

codecov Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

vincent-k2026 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Jun 30, 2026 •

edited

Loading