Add cluster-bootstrap 95% CIs for the perturbation recall tables#103
Add cluster-bootstrap 95% CIs for the perturbation recall tables#103dangng2004 wants to merge 2 commits into
Conversation
ci_recall.py reports the §5 recall tables with 95% bootstrap CIs over papers (resampling unit = paper, pooled recall = sum(detected)/sum(injected)). Point estimates match the perturbation scorer. Also gitignores generated figures under perturbation/plots/. Split out of the AUC CI PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Move paper sets, result dirs, and table specs into a local JSON config (--config), matching the conference-study ci_auc.py pattern. Split the pooled-recall + cluster-bootstrap math into pure paper_rows/boot_ci helpers and cover them in tests/test_ci_recall.py (in-memory counts, no fixtures). Empty cells render as NO DATA, and an unknown table kind errors out instead of raising deep in a renderer. Output is unchanged on real data. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
joe32140
left a comment
There was a problem hiding this comment.
Three substantive items from review — see inline comments. Math and tests look right; these are about silent-failure surfaces and design choices worth either guarding against or documenting.
| if method == "reviewer3": | ||
| p = root / f"full_{base_of(domain)}_reviewer3" / "reviewer3" / etype / "reviewer3" / paper / "score" / score / f"{paper}_score.json" | ||
| else: | ||
| p = root / domain / model / etype / method / paper / "score" / score / f"{paper}_score.json" | ||
| if not p.exists(): | ||
| return None | ||
| return json.loads(p.read_text()) |
There was a problem hiding this comment.
Silent-failure surface for reviewer3 results. The full_{base}_reviewer3 path layout is hardcoded here, and load_score returns None on a miss. If that layout ever drifts (rename, different nesting, etc.), every reviewer3 cell silently degrades to NO DATA without any error — and unit tests can't catch it because they use in-memory counts.
Suggest a small guard in load_cell or cell_ci: if a configured cell produces zero loaded papers, warn (or hard-fail). That way path drift surfaces immediately instead of as a blank cell in the table.
| line = f"{cell['label']:24s}" | ||
| for c in cats: | ||
| (p, lo, hi, d, i), n = cell_ci(config, root, cell["slug"], cell["method"], c) | ||
| line += " | " + ("NO DATA" if n == 0 else f"{p*100:4.1f}[{lo*100:.1f},{hi*100:.1f}]n{n}") |
There was a problem hiding this comment.
Two different NO DATA signals for the same condition. run_by_model (via fmt on line 125) decides from i == 0 (no injected errors), while run_by_category decides here from n == 0 (no papers contributed). They happen to agree today because paper_rows enforces injected > 0, but two signals for the same condition is a footgun if either invariant ever changes.
Suggest routing both through fmt using n == 0 — it's the more direct "no papers contributed" signal and removes the inline formatter on this line.
| inj = np.array([r[1] for r in rows], float) | ||
| point = det.sum() / inj.sum() | ||
| n = len(rows) | ||
| idx = RNG.integers(0, n, size=(B, n)) |
There was a problem hiding this comment.
Consumption-order seeding is a reproducibility footgun. The docstring (lines 17-19) honestly flags it, but it's worth surfacing here: because every cell pulls from the one module-level RNG, adding or reordering tables in the config silently shifts every later cell's CI. Someone editing the config to add a new table can unintentionally change the historical numbers for unrelated rows.
If the recall tables are meant to be stable across config-touching PRs, consider per-cell seeding, e.g. rng = np.random.default_rng(hash((slug, method, category))) passed into boot_ci. Otherwise this is a deliberate design choice — fine to keep, but worth a sentence in benchmarks/perturbation/README.md (or the script's --help) so future config editors aren't surprised.
Summary
Split out of #98 (which is now AUC-only). This PR modifies benchmark code. It might be helpful to get some context from the paper https://arxiv.org/abs/2606.19749. It adds 95% confidence intervals around the perturbation recall tables, via a cluster bootstrap over papers.
Method
The resampling unit is the paper, not the individual error, because errors injected into the same paper are scored together and are correlated. Each of 5000 draws resamples papers with replacement, recomputes the pooled recall (total detected divided by total injected), and the CI is the 2.5/97.5 percentiles (seed 42). Point estimates match the perturbation scorer, so the CIs sit around numbers that are already trusted (e.g. GPT-5.5 progressive = 571/797 = 71.6%).
Functions → paper tables
Each table
kindproduces one paper format:kindby_modelrun_by_modelby_categoryrun_by_categoryRepresentative table outputs
These reproduce the paper. A cell with no runs prints
NO DATArather than a confusingnan.Recall per model and method
Recall by error category, best backend per system (
nis the contributing paper count)Tests
The config and result JSONs are large and gitignored, so the runner needs local data and isn't executable in a fresh clone. Correctness is covered without data by
tests/test_ci_recall.py, which exercisespaper_rowsand the bootstrap on tiny in-memory count tables (per-paper aggregation, category filtering, a degenerate equal-recall case that collapses the CI, and an exact-bounds case under a fixed seed).Files
ci_recall.py— new, config-driven recall + CI runner (two kinds: by_model / by_category). Paper sets, result dirs, and table specs live in a local JSON config (--config).tests/test_ci_recall.py— unit tests on in-memory databenchmarks/perturbation/.gitignore— also gitignores generated figures underplots/