Skip to content

Add cluster-bootstrap 95% CIs for the perturbation recall tables#103

Open
dangng2004 wants to merge 2 commits into
mainfrom
feat/recall-cis
Open

Add cluster-bootstrap 95% CIs for the perturbation recall tables#103
dangng2004 wants to merge 2 commits into
mainfrom
feat/recall-cis

Conversation

@dangng2004

@dangng2004 dangng2004 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Split out of #98 (which is now AUC-only). This PR modifies benchmark code. It might be helpful to get some context from the paper https://arxiv.org/abs/2606.19749. It adds 95% confidence intervals around the perturbation recall tables, via a cluster bootstrap over papers.

Method

The resampling unit is the paper, not the individual error, because errors injected into the same paper are scored together and are correlated. Each of 5000 draws resamples papers with replacement, recomputes the pooled recall (total detected divided by total injected), and the CI is the 2.5/97.5 percentiles (seed 42). Point estimates match the perturbation scorer, so the CIs sit around numbers that are already trusted (e.g. GPT-5.5 progressive = 571/797 = 71.6%).

Functions → paper tables

Each table kind produces one paper format:

kind function paper table what it shows
by_model run_by_model overall recall recall per model and method (zero-shot / coarse / progressive), plus Reviewer 3
by_category run_by_category recall by error category recall split by error category (Surface / Claim / Reasoning / Experimental) for the best backend per system

Representative table outputs

These reproduce the paper. A cell with no runs prints NO DATA rather than a confusing nan.

Recall per model and method

GPT-5.5                |  59.8% [52.4, 67.3]  (477/797) |            NO DATA             |  71.6% [65.6, 76.9]  (571/797)
Claude-Opus-4.7        |  54.3% [49.0, 59.5]  (433/797) |            NO DATA             |  68.1% [62.8, 73.8]  (530/778)
Grok-4.1-Fast          |  31.4% [26.9, 35.7]  (250/797) |  12.8% [ 9.6, 16.0]  (102/797) |  52.6% [46.5, 59.1]  (419/797)
DeepSeek-V4-Flash      |  31.0% [26.2, 35.6]  (247/797) |  20.7% [16.6, 25.0]  (165/797) |  55.8% [50.0, 61.1]  (431/772)
Qwen3.6-35B-A3B        |  31.4% [26.6, 36.6]  (250/797) |  18.2% [14.7, 21.8]  (145/797) |  45.4% [39.9, 51.2]  (362/797)
Gemini-3.1-Flash-Lite  |  14.7% [11.3, 18.0]  (117/797) |  12.3% [ 9.3, 15.1]  (98/797) |  33.0% [28.1, 37.2]  (263/797)
Reviewer3 (overall)   |  26.5% [20.6, 32.9]  (183/691)

Recall by error category, best backend per system (n is the contributing paper count)

cell                     | Overall | Experimental | Claim | Reasoning | Surface
coarse / DeepSeek-V4     | 20.7[16.6,24.9]n24 | 26.3[21.3,31.1]n12 | 19.6[12.8,26.6]n20 | 40.0[35.7,50.0]n2 | 16.0[8.5,24.2]n23
zero-shot / GPT-5.5      | 59.8[52.7,67.2]n24 | 67.0[50.0,80.8]n12 | 67.0[58.6,74.5]n20 | 60.0[57.1,66.7]n2 | 45.8[31.5,60.3]n23
OpenAIReview / GPT-5.5   | 71.6[65.9,77.0]n24 | 85.2[77.0,92.0]n12 | 83.3[75.2,89.9]n20 | 70.0[64.3,83.3]n2 | 47.3[33.8,60.9]n23
Reviewer3                | 26.5[20.2,32.8]n22 | 35.4[25.7,45.7]n11 | 28.1[18.6,38.3]n17 | 10.0[0.0,33.3]n2 | 18.8[11.2,27.7]n20

Tests

The config and result JSONs are large and gitignored, so the runner needs local data and isn't executable in a fresh clone. Correctness is covered without data by tests/test_ci_recall.py, which exercises paper_rows and the bootstrap on tiny in-memory count tables (per-paper aggregation, category filtering, a degenerate equal-recall case that collapses the CI, and an exact-bounds case under a fixed seed).

$ pytest tests/test_ci_recall.py -v

tests/test_ci_recall.py::test_paper_rows_sums_error_types_per_paper PASSED
tests/test_ci_recall.py::test_paper_rows_filters_by_category PASSED
tests/test_ci_recall.py::test_paper_rows_drops_zero_injected_paper PASSED
tests/test_ci_recall.py::test_boot_ci_point_estimate_and_counts PASSED
tests/test_ci_recall.py::test_boot_ci_pins_exact_bounds_under_fixed_seed PASSED
tests/test_ci_recall.py::test_boot_ci_degenerate_equal_recall_collapses_ci PASSED
tests/test_ci_recall.py::test_boot_ci_empty_is_nan PASSED
tests/test_ci_recall.py::test_dispatch_covers_the_two_kinds PASSED

8 passed

Files

  • ci_recall.py — new, config-driven recall + CI runner (two kinds: by_model / by_category). Paper sets, result dirs, and table specs live in a local JSON config (--config).
  • tests/test_ci_recall.py — unit tests on in-memory data
  • benchmarks/perturbation/.gitignore — also gitignores generated figures under plots/

ci_recall.py reports the §5 recall tables with 95% bootstrap CIs over
papers (resampling unit = paper, pooled recall = sum(detected)/sum(injected)).
Point estimates match the perturbation scorer. Also gitignores generated
figures under perturbation/plots/.

Split out of the AUC CI PR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Move paper sets, result dirs, and table specs into a local JSON config
(--config), matching the conference-study ci_auc.py pattern. Split the
pooled-recall + cluster-bootstrap math into pure paper_rows/boot_ci
helpers and cover them in tests/test_ci_recall.py (in-memory counts, no
fixtures). Empty cells render as NO DATA, and an unknown table kind
errors out instead of raising deep in a renderer.

Output is unchanged on real data.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dangng2004 dangng2004 marked this pull request as ready for review June 27, 2026 02:05
@dangng2004 dangng2004 requested a review from joe32140 June 27, 2026 02:08

@joe32140 joe32140 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three substantive items from review — see inline comments. Math and tests look right; these are about silent-failure surfaces and design choices worth either guarding against or documenting.

Comment on lines +85 to +91
if method == "reviewer3":
p = root / f"full_{base_of(domain)}_reviewer3" / "reviewer3" / etype / "reviewer3" / paper / "score" / score / f"{paper}_score.json"
else:
p = root / domain / model / etype / method / paper / "score" / score / f"{paper}_score.json"
if not p.exists():
return None
return json.loads(p.read_text())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent-failure surface for reviewer3 results. The full_{base}_reviewer3 path layout is hardcoded here, and load_score returns None on a miss. If that layout ever drifts (rename, different nesting, etc.), every reviewer3 cell silently degrades to NO DATA without any error — and unit tests can't catch it because they use in-memory counts.

Suggest a small guard in load_cell or cell_ci: if a configured cell produces zero loaded papers, warn (or hard-fail). That way path drift surfaces immediately instead of as a blank cell in the table.

line = f"{cell['label']:24s}"
for c in cats:
(p, lo, hi, d, i), n = cell_ci(config, root, cell["slug"], cell["method"], c)
line += " | " + ("NO DATA" if n == 0 else f"{p*100:4.1f}[{lo*100:.1f},{hi*100:.1f}]n{n}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two different NO DATA signals for the same condition. run_by_model (via fmt on line 125) decides from i == 0 (no injected errors), while run_by_category decides here from n == 0 (no papers contributed). They happen to agree today because paper_rows enforces injected > 0, but two signals for the same condition is a footgun if either invariant ever changes.

Suggest routing both through fmt using n == 0 — it's the more direct "no papers contributed" signal and removes the inline formatter on this line.

inj = np.array([r[1] for r in rows], float)
point = det.sum() / inj.sum()
n = len(rows)
idx = RNG.integers(0, n, size=(B, n))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consumption-order seeding is a reproducibility footgun. The docstring (lines 17-19) honestly flags it, but it's worth surfacing here: because every cell pulls from the one module-level RNG, adding or reordering tables in the config silently shifts every later cell's CI. Someone editing the config to add a new table can unintentionally change the historical numbers for unrelated rows.

If the recall tables are meant to be stable across config-touching PRs, consider per-cell seeding, e.g. rng = np.random.default_rng(hash((slug, method, category))) passed into boot_ci. Otherwise this is a deliberate design choice — fine to keep, but worth a sentence in benchmarks/perturbation/README.md (or the script's --help) so future config editors aren't surprised.

@joe32140 joe32140 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added minor comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants