feat(bench): CL-bench family gates (Context Learning + Continual/Codebase Adaptation)#180
Conversation
Integrate Tencent/Fudan CL-bench (arXiv:2602.03587) as a router-only selector gate. CL-bench grades a model's answer to an in-context-knowledge task against expert rubrics; the official metric is binary (pass ALL rubrics, avg ~63/task), but the per-rubric pass-count yields a CONTINUOUS score (fraction satisfied) — the within-task graded variance a verifier-grounded selector needs and that the pass/fail-deterministic benches (aec) lacked. The gate (modeled on humaneval-gate) runs two paired arms over the same tasks — random@K identical completions vs diverse@K strategy-lensed completions — grades each with the benchmark's own rubric judge (an LLM, run by us = deployable but noisy, so we rank by the variance-reduced fraction not the binary, judge model + temp pinned), verifier-selects by fraction, and reports paired-bootstrap lifts on BOTH the continuous fraction and the official binary. Writes a corpus RunRecord/ task that `corpus-replay --selector=verifier` + `corpus-report` consume unchanged. Router-only (no sandbox); fetches the public HF jsonl via curl|head so a smoke pulls only the first N records. Fail loud on a malformed/empty task set or a failed judge parse (a real zero, never masked).
…ed selector gate Integrate pgasawa CL-Bench (arXiv:2606.05661) codebase_adaptation — the ONE of its six domains with a DEPLOYABLE checker (the rest grade against realized outcomes the agent never has = oracles). Its scorer applies the instance's provided test_patch and runs pytest in the instance's Docker image, keying off the exit code: an INDEPENDENT deployable check (tests ≠ answer), the clean analogue of the HumanEval gate and unlike the CL-bench Context gate where the rubric judge IS the metric. Two pieces: - scripts/clbench_codebase_judge.py — a thin bridge exposing CL-Bench's own `evaluate_submission(patch, instance)` as a standalone (instance_id, patch) -> verdict call (run in CL-Bench's venv). Verified to self-check: gold patch passes, empty fails. - src/clbench-codebase-gate.mts — the gate. Each instance is SWE-bench format; the worker is a fault-isolated sandbox rollout (opencode clones repo@base_commit, fixes source, writes a diff read off the box FS). Two paired arms (random@K identical vs diverse@K strategy-lensed), verifierGroundedSelect by pytest-pass, paired-bootstrap lifts on blind/random@k/diverse@k/oracle@k, and a corpus RunRecord/task that corpus-replay --selector=verifier consumes. Infra-errored rollouts/judges are excluded, never scored 0. Needs Docker + the CL-Bench images (`clbench setup codebase_adaptation`) for judging and a reachable sandbox for rollouts. Independent of the CL-bench (Context) gate (separate PR).
❌ Needs Work —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 34 | 76 | 34 |
| Confidence | 65 | 65 | 65 |
| Correctness | 34 | 76 | 34 |
| Security | 34 | 76 | 34 |
| Testing | 34 | 76 | 34 |
| Architecture | 34 | 76 | 34 |
Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision.
Blocking
🟣 CRITICAL diversifyMessages ignores lensSystem when no system role — diverse arm degenerates to uniform — bench/src/clbench-context-gate.mts
Line 114:
return [{ role: 'system', content: composeStrategies(baseSystem, 1)[0] as string }, ...messages]ignores thelensSystemparameter entirely.composeStrategies(base, 1)always returns the FIRST strategy lens (DIVERSE_STRATEGY_LENSES[0]). For any CL-bench task whosemessagesarray has no system role as the first message, ALL k diverse shots receive the IDENTICAL system message. This completely defeats the diversity mechanism on those tasks — the diverse@k arm produces no more variance than the random arm. Since CL-bench conversations may or may not have a system message (the benchmark's format varies), this silently corrupts a subset of r
Other
🟠 MEDIUM Single API failure crashes entire benchmark run, losing all prior work — bench/src/clbench-context-gate.mts
Lines 262-268: the
pool()helper usesPromise.allwithout per-item error handling. If any singlerouterChatWithUsagecall (worker solve or judge grade) rejects after retries are exhausted — e.g. a persistent 500, a malformed response that failsparseJudgeJSON.parse — the entire benchmark crashes. For a typical run (N=20, K=4: 160 worker calls + 160 judge calls = 320 API calls), a failure on call #319 discards all prior results. The file's design says 'fail loud', but per-try failures (vs fatal config errors) should be surfaced per-shot rather than aborting the full gate. Consider catching errors per-shot in the pool callback and returning a
🟠 MEDIUM diversifyMessages ignores lensSystem in no-system-turn branch — bench/src/clbench-context-gate.mts
Line 114: when
messages[0]?.role !== 'system', the else branch callscomposeStrategies(baseSystem, 1)[0]instead of using the passedlensSystem. SincecomposeStrategies(base, 1)always returns a single-element array usingDIVERSE_STRATEGY_LENSES[0], ALL k diverse shots for a task without a system turn receive the IDENTICAL first strategy lens — defeating the diversity treatment. The call site (line 258) passeslenses[s]which already carries the per-shot lens, but it's silently dropped. Impact: the diverse@k
🟠 MEDIUM readFileSync imported but never used — bench/src/clbench-context-gate.mts
Line 36:
import { existsSync, readFileSync } from 'node:fs'.readFileSyncis never called anywhere in the file. Dead import — should be removed.
🟡 LOW OFFSET env var not validated as non-negative integer — bench/src/clbench-context-gate.mts
Line 230:
offset = Number(process.env.OFFSET ?? 0)has no validation.NandKare validated (lines 237-238) but OFFSET is not. A negative or non-integer offset producesneed = offset + limitwhich can causehead -non GNU coreutils to interpret a negative count as 'all but last N lines', silently fetching a different task set than intended. Fix: addif (!Number.isInteger(offset) || offset < 0) throw new Error(...)alongside the N/K validation block.
🟡 LOW Oracle metrics cover diverse arm only, not both arms — bench/src/clbench-context-gate.mts
Lines 291, 296:
fr.oracleandbin.oraclecompute the maximum over diverse shots only (Math.max(...dFr)anddFr.some(...)). The true oracle ceiling should be the max over ALL attempts (both random and diverse arms), since a random shot could pass while all diverse shots fail (e.g., the lens confuses the model). The label 'oracle@k' implies the ceiling of the k-attempt regime. Consider computing max across both arms:Math.max(...rFr, ...dFr).
🟡 LOW Unused import: readFileSync — bench/src/clbench-context-gate.mts
Line 36:
readFileSyncis imported fromnode:fsbut never referenced. OnlyexistsSync(line 71) is used. Dead import will trigger lint if the Biome config catches unused imports. Fix: change toimport { existsSync } from 'node:fs'.
🟡 LOW parseJudge JSON.parse throws on malformed judge output, contradicting doc comment — bench/src/clbench-context-gate.mts
Line 139:
JSON.parse(text.trim())can throw if the judge LLM returns non-JSON. The doc comment onjudgeRubrics(line 149-150) says 'a judge API/parse failure is a real zero — surfaced, never masked', implying a zero-valued verdict is returned. Instead the exception propagates throughpool, killing the entire benchmark run. This is actually consistent with the repo's fail-loud philosophy, but the comment is misleading. Either wrap in try/catch returning a zero verdict, or update the comment to match the throw b
🟡 LOW parseJudge allPass can be true while fraction < 1.0 — inconsistency — bench/src/clbench-context-gate.mts
Line 145:
const allPass = obj.all_pass === 1 || obj.all_pass === '1' || (status.length === rubricCount && yes === rubricCount && rubricCount > 0). When the judge returns an explicitall_pass: 1but a status array shorter than rubricCount (or with some 'no' entries),allPassis true whilefractionmay be < 1.0. These two fields describe the same verdict; they should be consistent. Consider derivingallPasssolely from the per-rubric status list when available, only falling back to the explicit field when the status list is absent.
tangletools · 2026-06-06T17:51:07Z · trace
tangletools
left a comment
There was a problem hiding this comment.
❌ 1 Blocking Finding — d3e72e96
Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-06T17:51:07Z · immutable trace
Premise check withheld merge —
|
…ap router models resolve The in-box opencode agent validates `openai`/`anthropic` models against its registry, so `openai/deepseek-chat` failed with "Model not found" (an empty-patch every rollout). The `openai-compat` provider is the generic passthrough — it does NOT validate the model name — so router-served cheap models (deepseek-chat, moonshotai/kimi-k2.6, glm) resolve in-box. Default the worker to openai-compat (override via WORKER_PROVIDER); verified both deepseek and kimi write a file in a live sandbox rollout.
CL-bench family — two benchmarks, one PR
Integrates both CL-bench benchmarks as gates in
bench/.CL-bench (Context Learning) — Tencent/Fudan, arXiv:2602.03587
bench/src/clbench-context-gate.mts— router-only. Rubric-graded LLM judge → continuous fraction (within-task signal) + the official binary. Two paired arms (random@K / diverse@K), verifier-select by fraction, paired-bootstrap lifts, writes a corpus. Role: leaderboard + compute/diversity instrument — its rubric judge is the metric, so the "verifier" selection is partly circular (worker==judge confound when both are the same model; fleet runs split worker/judge to break it).CL-Bench (Continual Learning) — pgasawa, arXiv:2606.05661
bench/src/clbench-codebase-gate.mts+bench/scripts/clbench_codebase_judge.py. Of its six domains, only Codebase Adaptation is a deployable checker (applies the providedtest_patch+ runs pytest in the instance's Docker image, exit-code = pass — an independent check, the clean analogue of the HumanEval gate); the other five grade against realized outcomes (oracles). The judge bridge exposes CL-Bench's ownevaluate_submission(patch, instance)standalone, self-verified (gold passes, empty fails). The worker is a fault-isolated sandbox rollout viaopenSandboxRunsemantics; cheap models resolve in-box via theopenai-compatprovider (the fix for "Model not found: openai/deepseek-chat").Verification
pnpm typecheck·pnpm test(green) ·bench: tsc --noEmitclean. Both gates ran end-to-end on the live router/sandbox; fleet runs in progress for powered signals. Deliberately NOT in scope: the CL-Bench thesis integration (our runtime as asystem, the stateful-vs-stateless gain metric) — a fast follow.