feat(bench): research-leaderboard worker (router-RAG) + sandboxAgentRun env passthrough#186
Conversation
…un env passthrough research-gate.mts: off-sandbox research-bench leaderboard (model x web-search-provider x multi-shot) over the router -- provider-pinned /v1/search + web_fetch, then answer. Deep-cleaned onto the kernel primitives (routerChatWithUsage, runPool, appendRunRecord, adapter.judge); deleted the reinvented pool/corpus/sandbox backends. 424 -> 259 lines. experiment.ts: sandboxAgentRun gains an optional env passthrough (merged onto OPENAI_*), letting a caller pin the in-box agent search provider (TANGLE_SEARCH_DEFAULT_PROVIDER). rsi.ts forwards SEARCH / EXA_API_KEY to the box via it. Verified: tsc clean; SimpleQA you-arm reproduces 2/2 through the cleaned worker.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 69 | 76 | 69 |
| Confidence | 65 | 65 | 65 |
| Correctness | 69 | 76 | 69 |
| Security | 69 | 76 | 69 |
| Testing | 69 | 76 | 69 |
| Architecture | 69 | 76 | 69 |
Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Router usage/cost data discarded when building corpus records — bench/src/research-gate.mts
Line 148:
routerChatWithUsagereturns{ content, usage?, costUsd? }with real token counts and derived cost. But lines 214-223 buildAttemptRecordmanually and never populatecostUsd,tokensIn, ortokensOut. The corpus contract (corpus.ts:25-31) says these fields are "present only when the worker actually reported them" — the worker DID report them, they're just being dropped. Downstream tools (corpus-report, eval gate) rely on cost-aware comparisons. Fix: capture{ usage, costUsd }from therouterChatWithUsagere
🟠 MEDIUM Search failure produces contradictory model prompt — bench/src/research-gate.mts
Lines 102-138: When
useSearchis true but the search HTTP call returns non-200,contextstays empty (line 114: warn + fall through). At line 133, the system prompt still says "Use the WEB SEARCH RESULTS below (snippets + fetched page content) as your primary evidence; cite the source" becauseuseSearchremains true. But at [line 138](https://github.com/tangle-network/agent-runtime/blob/521c8bff30
🟠 MEDIUM research-gate AttemptRecords lack costUsd/tokensIn/tokensOut — all records are unmappable by corpus projection — bench/src/research-gate.mts
Lines 214-223: The AttemptRecord written by research-gate omits costUsd, tokensIn, tokensOut. The corpus projection (corpus.ts:203-210) requires all three and marks any attempt missing them as 'unmappable'. This means
benchRecordToCorpusRecords()will reject every attempt from research-gate. If research-gate records are only consumed by corpus-report.mts (which reads raw RunRecords, not canonical projections), this is benign. But if anyone runs the canonical bridge on a research-gate corpus, it silently produces zero records. The RunRecord itself still writes fine via appendRunRecord — this is only a downstream-projection gap. routerChatWithUsage returns
🟡 LOW env spread can override router auth keys — bench/src/experiment.ts
Line 217:
env: { OPENAI_API_KEY: opts.routerKey, OPENAI_BASE_URL: opts.routerBaseUrl, ...opts.env }— caller-suppliedenvis spread ON TOP and can clobberOPENAI_API_KEYandOPENAI_BASE_URL. The JSDoc (line 201) documents this as "merged ON TOP" which is accurate, but there's no runtime guard against accidental override. A caller who unintentionally includesOPENAI_API_KEYin their env object would silently break router auth for that run. Fix: defensive check thatopts.envdoesn't contain auth-critical keys, or log a warning.
🟡 LOW Inconsistent trailing-slash handling on routerBaseUrl — bench/src/research-gate.mts
routerChatWithUsage(router-client.ts:32) strips trailing slashes:cfg.routerBaseUrl.replace(/\/$/, ''). ButrunResearchShot(line 106) andfetchPage(line 67) use raw${cfg.routerBaseUrl}/search...and${cfg.routerBaseUrl}/search/mcp.... A user-suppliedROUTER_BASEending in/(e.g.https://router.tangle.tools/v1/) produces double-slash URLs (//search) in the research gate but not in the router-client, causing silent failures in one path but not the other. Fix: apply the samereplace(/\/$/, '')guard to the base
🟡 LOW No test coverage for new research-gate.mts (249 lines) — bench/src/research-gate.mts
No tests found in bench/tests/ for research-gate, experiment, or rsi. The bench scripts appear to be integration-test-only (run against live router/sandbox), but the pure functions (fetchPage, runResearchShot's query-extraction logic, the search sentinel check) are unit-testable. The query-extraction line
task.prompt.split('\n').find(l => l.trim().length > 0)(line 105) has an edge case: if the first non-empty line is a metadata header or instruction rather than the actual question, the search query will be poor. This is testable in isolation.
🟡 LOW fetchPage silently swallows all errors — no logging on auth/parse/network failures — bench/src/research-gate.mts
Lines 65-86: fetchPage catches all errors and returns '' with zero logging. A 401 auth failure, DNS failure, or malformed JSON response is invisible. The search shot surface logs a search FAIL (line 114) but page-fetch failures within Promise.all (line 120) are completely silent. When fetched[i] is '', the context silently omits that page. This is correct for fault isolation but makes debugging zero-page
🟡 LOW rsi.ts forwards EXA_API_KEY unconditionally — any process.env.EXA_API_KEY leaks into sandbox — bench/src/rsi.ts
Line 54:
if (process.env.EXA_API_KEY) searchEnv.EXA_API_KEY = process.env.EXA_API_KEYforwards the key without allowlisting. The sandboxAgentRun docstring (experiment.ts:201-204) says 'Allowlisted keys only reach the spawned CLI' but the caller in rsi.ts does its own allowlisting by explicit key name. This is fine functionally, but the allowlist is分散 across callers rather than enforced at the sandboxAgentRun layer. If a caller accidentally spreads a broader env object, keys leak. Low risk in current code; noting for the allowlist-in-one-place principle.
tangletools · 2026-06-06T22:45:19Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 8 non-blocking findings — 521c8bff
Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-06T22:45:19Z · immutable trace
Premise check withheld merge —
|
# Conflicts: # bench/src/research-gate.mts
What
An off-sandbox research-bench leaderboard worker, plus the one shared-primitive extension it needs.
bench/src/research-gate.mts(new): model × web-search-provider × multi-shot leaderboard on the research benches (finsearchcomp / frames / hotpotqa / simpleqa), run over the router as pure HTTP (off-sandbox, so it never contends with sandbox-bound gates). Per shot: provider-pinned/v1/search?provider=<id>+web_fetchof the top-K result pages, then answer with that evidence — no tools on the answer call, so every arm differs only by the search provider (a clean controlled A/B).SEARCH=defaultskips search (the parametric control).bench/src/experiment.ts:sandboxAgentRungains an optionalenvpassthrough, merged onto the standardOPENAI_*box env — so a caller can pin the in-box agent's web-search provider (TANGLE_SEARCH_DEFAULT_PROVIDER) / forward provider keys. Additive; existing callers unchanged.bench/src/rsi.ts: forwardsSEARCH/EXA_API_KEYto the box via that passthrough.Why
First clean result through this path: on SimpleQA, you.com search lifts every model to ~90% (a model-equalizer; +70pp for gpt-4o-mini, +20pp for the strong models). The worker started as a 423-line standalone that reinvented
runExperiment/runPool/sandboxAgentRun/ corpus; this is the deep-cleaned 259-line version that consolidates onto the one-flow kernel primitives (routerChatWithUsage,runPool,appendRunRecord,adapter.judge) with zero reinvented pool/corpus/sandbox code.Verification
research-gate.mts:tsc -p bench/tsconfig.jsonclean.bench/src/experiment.ts:309(lineage: { streaming: 'poll' }) shows a pre-existing tsc error onmain— the bench's installed@tangle-network/agent-runtimeis behindsrc(nolineagefield yet). Not introduced by this PR; flagged for a dep bump.Related (tracked separately, not in this PR)