Add cluster-bootstrap 95% CIs for the perturbation recall tables by dangng2004 · Pull Request #103 · ChicagoHAI/OpenAIReview

dangng2004 · 2026-06-24T22:41:14Z

Summary

Split out of #98 (which is now AUC-only). This PR modifies benchmark code. It might be helpful to get some context from the paper https://arxiv.org/abs/2606.19749. It adds 95% confidence intervals around the perturbation recall tables, via a cluster bootstrap over papers.

Method

The resampling unit is the paper, not the individual error, because errors injected into the same paper are scored together and are correlated. Each of 5000 draws resamples papers with replacement, recomputes the pooled recall (total detected divided by total injected), and the CI is the 2.5/97.5 percentiles (seed 42). Point estimates match the perturbation scorer, so the CIs sit around numbers that are already trusted (e.g. GPT-5.5 progressive = 571/797 = 71.6%).

Functions → paper tables

Each table kind produces one paper format:

`kind`	function	paper table	what it shows
`by_model`	`run_by_model`	overall recall	recall per model and method (zero-shot / coarse / progressive), plus Reviewer 3
`by_category`	`run_by_category`	recall by error category	recall split by error category (Surface / Claim / Reasoning / Experimental) for the best backend per system

Representative table outputs

These reproduce the paper. A cell with no runs prints NO DATA rather than a confusing nan.

Recall per model and method

GPT-5.5                |  59.8% [52.4, 67.3]  (477/797) |            NO DATA             |  71.6% [65.6, 76.9]  (571/797)
Claude-Opus-4.7        |  54.3% [49.0, 59.5]  (433/797) |            NO DATA             |  68.1% [62.8, 73.8]  (530/778)
Grok-4.1-Fast          |  31.4% [26.9, 35.7]  (250/797) |  12.8% [ 9.6, 16.0]  (102/797) |  52.6% [46.5, 59.1]  (419/797)
DeepSeek-V4-Flash      |  31.0% [26.2, 35.6]  (247/797) |  20.7% [16.6, 25.0]  (165/797) |  55.8% [50.0, 61.1]  (431/772)
Qwen3.6-35B-A3B        |  31.4% [26.6, 36.6]  (250/797) |  18.2% [14.7, 21.8]  (145/797) |  45.4% [39.9, 51.2]  (362/797)
Gemini-3.1-Flash-Lite  |  14.7% [11.3, 18.0]  (117/797) |  12.3% [ 9.3, 15.1]  (98/797) |  33.0% [28.1, 37.2]  (263/797)
Reviewer3 (overall)   |  26.5% [20.6, 32.9]  (183/691)

Recall by error category, best backend per system (n is the contributing paper count)

cell                     | Overall | Experimental | Claim | Reasoning | Surface
coarse / DeepSeek-V4     | 20.7[16.6,24.9]n24 | 26.3[21.3,31.1]n12 | 19.6[12.8,26.6]n20 | 40.0[35.7,50.0]n2 | 16.0[8.5,24.2]n23
zero-shot / GPT-5.5      | 59.8[52.7,67.2]n24 | 67.0[50.0,80.8]n12 | 67.0[58.6,74.5]n20 | 60.0[57.1,66.7]n2 | 45.8[31.5,60.3]n23
OpenAIReview / GPT-5.5   | 71.6[65.9,77.0]n24 | 85.2[77.0,92.0]n12 | 83.3[75.2,89.9]n20 | 70.0[64.3,83.3]n2 | 47.3[33.8,60.9]n23
Reviewer3                | 26.5[20.2,32.8]n22 | 35.4[25.7,45.7]n11 | 28.1[18.6,38.3]n17 | 10.0[0.0,33.3]n2 | 18.8[11.2,27.7]n20

Tests

The config and result JSONs are large and gitignored, so the runner needs local data and isn't executable in a fresh clone. Correctness is covered without data by tests/test_ci_recall.py, which exercises paper_rows and the bootstrap on tiny in-memory count tables (per-paper aggregation, category filtering, a degenerate equal-recall case that collapses the CI, and an exact-bounds case under a fixed seed).

$ pytest tests/test_ci_recall.py -v

tests/test_ci_recall.py::test_paper_rows_sums_error_types_per_paper PASSED
tests/test_ci_recall.py::test_paper_rows_filters_by_category PASSED
tests/test_ci_recall.py::test_paper_rows_drops_zero_injected_paper PASSED
tests/test_ci_recall.py::test_boot_ci_point_estimate_and_counts PASSED
tests/test_ci_recall.py::test_boot_ci_pins_exact_bounds_under_fixed_seed PASSED
tests/test_ci_recall.py::test_boot_ci_degenerate_equal_recall_collapses_ci PASSED
tests/test_ci_recall.py::test_boot_ci_empty_is_nan PASSED
tests/test_ci_recall.py::test_dispatch_covers_the_two_kinds PASSED

8 passed

Files

ci_recall.py — new, config-driven recall + CI runner (two kinds: by_model / by_category). Paper sets, result dirs, and table specs live in a local JSON config (--config).
tests/test_ci_recall.py — unit tests on in-memory data
benchmarks/perturbation/.gitignore — also gitignores generated figures under plots/

ci_recall.py reports the §5 recall tables with 95% bootstrap CIs over papers (resampling unit = paper, pooled recall = sum(detected)/sum(injected)). Point estimates match the perturbation scorer. Also gitignores generated figures under perturbation/plots/. Split out of the AUC CI PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Move paper sets, result dirs, and table specs into a local JSON config (--config), matching the conference-study ci_auc.py pattern. Split the pooled-recall + cluster-bootstrap math into pure paper_rows/boot_ci helpers and cover them in tests/test_ci_recall.py (in-memory counts, no fixtures). Empty cells render as NO DATA, and an unknown table kind errors out instead of raising deep in a renderer. Output is unchanged on real data. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

joe32140

Three substantive items from review — see inline comments. Math and tests look right; these are about silent-failure surfaces and design choices worth either guarding against or documenting.

joe32140 · 2026-06-28T03:35:40Z

+    if method == "reviewer3":
+        p = root / f"full_{base_of(domain)}_reviewer3" / "reviewer3" / etype / "reviewer3" / paper / "score" / score / f"{paper}_score.json"
+    else:
+        p = root / domain / model / etype / method / paper / "score" / score / f"{paper}_score.json"
+    if not p.exists():
+        return None
+    return json.loads(p.read_text())


Silent-failure surface for reviewer3 results. The full_{base}_reviewer3 path layout is hardcoded here, and load_score returns None on a miss. If that layout ever drifts (rename, different nesting, etc.), every reviewer3 cell silently degrades to NO DATA without any error — and unit tests can't catch it because they use in-memory counts.

Suggest a small guard in load_cell or cell_ci: if a configured cell produces zero loaded papers, warn (or hard-fail). That way path drift surfaces immediately instead of as a blank cell in the table.

joe32140 · 2026-06-28T03:35:40Z

+        line = f"{cell['label']:24s}"
+        for c in cats:
+            (p, lo, hi, d, i), n = cell_ci(config, root, cell["slug"], cell["method"], c)
+            line += " | " + ("NO DATA" if n == 0 else f"{p*100:4.1f}[{lo*100:.1f},{hi*100:.1f}]n{n}")


Two different NO DATA signals for the same condition. run_by_model (via fmt on line 125) decides from i == 0 (no injected errors), while run_by_category decides here from n == 0 (no papers contributed). They happen to agree today because paper_rows enforces injected > 0, but two signals for the same condition is a footgun if either invariant ever changes.

Suggest routing both through fmt using n == 0 — it's the more direct "no papers contributed" signal and removes the inline formatter on this line.

joe32140 · 2026-06-28T03:35:40Z

+    inj = np.array([r[1] for r in rows], float)
+    point = det.sum() / inj.sum()
+    n = len(rows)
+    idx = RNG.integers(0, n, size=(B, n))


Consumption-order seeding is a reproducibility footgun. The docstring (lines 17-19) honestly flags it, but it's worth surfacing here: because every cell pulls from the one module-level RNG, adding or reordering tables in the config silently shifts every later cell's CI. Someone editing the config to add a new table can unintentionally change the historical numbers for unrelated rows.

If the recall tables are meant to be stable across config-touching PRs, consider per-cell seeding, e.g. rng = np.random.default_rng(hash((slug, method, category))) passed into boot_ci. Otherwise this is a deliberate design choice — fine to keep, but worth a sentence in benchmarks/perturbation/README.md (or the script's --help) so future config editors aren't surprised.

joe32140

added minor comments

dangng2004 marked this pull request as draft June 24, 2026 22:42

dangng2004 mentioned this pull request Jun 24, 2026

Add cluster-bootstrap 95% CIs for the paper's AUC tables #98

Merged

dangng2004 marked this pull request as ready for review June 27, 2026 02:05

dangng2004 requested a review from joe32140 June 27, 2026 02:08

joe32140 reviewed Jun 28, 2026

View reviewed changes

joe32140 approved these changes Jun 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add cluster-bootstrap 95% CIs for the perturbation recall tables#103

Add cluster-bootstrap 95% CIs for the perturbation recall tables#103
dangng2004 wants to merge 2 commits into
mainfrom
feat/recall-cis

dangng2004 commented Jun 24, 2026 •

edited

Loading

Uh oh!

joe32140 left a comment

Uh oh!

joe32140 Jun 28, 2026

Uh oh!

joe32140 Jun 28, 2026

Uh oh!

joe32140 Jun 28, 2026

Uh oh!

joe32140 left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dangng2004 commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Method

Functions → paper tables

Representative table outputs

Tests

Files

Uh oh!

joe32140 left a comment

Choose a reason for hiding this comment

Uh oh!

joe32140 Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

joe32140 Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

joe32140 Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

joe32140 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dangng2004 commented Jun 24, 2026 •

edited

Loading

joe32140 left a comment •

edited

Loading