Curiocity v1 — evals harness for interactive coding-agent CLIs (Claude Code + Codex)#125
Open
isolomatov-gd wants to merge 61 commits into
Open
Curiocity v1 — evals harness for interactive coding-agent CLIs (Claude Code + Codex)#125isolomatov-gd wants to merge 61 commits into
isolomatov-gd wants to merge 61 commits into
Conversation
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Create the curiocity npm package (ESM, Node >=20, bin curiocity -> dist/cli.js) with tsup/vitest/tsc tooling and the shared/ layer per arch.md §3/§5: - generic Registry<T> (§5.1) and error classes incl. UnknownIdError (known-ids list) - pino logger util; Role/ModelRoles schema (§5.6) - TrajectoryEvent + QnaEntry + Usage zod schemas (§5.2) - IPC message types and TrialSpec (§4, zod); MatrixCell shape Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
config/ (§5.2, §9, D13/D14): - zod schemas: AgentProfile, top-level config, case config, pricing, gate - loader (top-level, optional file) - precedence merge defaults < top-level < case < CLI; provisioning merge-by-name; setup/teardown CONCAT (never override); per-role models merge - pure matrix resolution (agent × case × repeat) folding the full model chain top-level < profile < case < CLI cases/ (§8, D7): - discovery (immediate subfolders, all-5-files rule, skip-with-reason) - case validation (missing-files vs invalid-config reasons) - ephemeral (inline) case builder with neutral defaults Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
results/ (§14, D8): - trial.json / suite.json zod schemas with schemaVersion (statuses incl. agent-hung) - run-dir store (timestamped dir, trial artifacts, suite files) - loader for `report` cli/ (§13, D4): - commander program: run / report / validate - validate: fully functional discovery + skip reasons - run: parses+validates config, resolves the matrix, prints it for --dry-run (suite vs inline via flags); exits "not implemented" after resolution otherwise - report: loads a run dir and prints a status summary; recompute stubbed - §13 exit codes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
vitest unit tests (§15): precedence merge incl. D13 (scalar/provision-by-name/ models) and D14 (setup/teardown concat), case discovery (valid/skip reasons), ephemeral builder, all zod schemas (accept/reject), results store roundtrip. test/fixtures/cases: hello-world (valid), incomplete (missing files), broken-config (schema failure) — drive `validate` and `run --dry-run`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…recedence in arch.md Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
node-pty, @xterm/headless, execa, extract-zip, p-limit, strip-ansi (§16). smoke -> vitest run test/integration; vitest now includes integration. postinstall restores the executable bit npm can strip off node-pty's prebuilt spawn-helper (D12), which otherwise fails pty.fork. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…or the runtime ModelRouter interface (§5.6) as a port only (real AI-SDK impl is M3), plus a scripted, per-call FakeModelRouter test util that throws on any unscripted call (catches P3-violating injections). TrialSpec gains the fields the orchestrator/ curion runtime needs: profile (opaque), runDir, keep/mirror/evaluate flags, and an optional fakeRouter seam for token-free tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pane interface (one pane in v1, panes[]/primary seam per §16). Rendered ANSI-free snapshot of the visible grid. write() chunks + yields the event loop to honor backpressure while the read loop always drains (§4); submitLine() applies the profile submit sequencing (enter | paste+enter). CJS deps default-imported so the forked child's native ESM resolves them. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…kAdapter AgentAdapter contract + core-owned canonical control protocol (CanonicalHookSpec/ StopSignal, LaunchPlan), composeLaunchPlan glue (the identical 3-step prepare for every adapter), env filtering, and the agent registry. Mock agent (§10.3, D10): a scripted zero-dep TUI driven by a JSON scene that itself writes session-start.json + stop.jsonl like real hooks, plus a full MockAdapter incl. a deterministic task_complete completion detector (P4 / §10.2) so runs need zero LLM. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…loop ChangeMonitor (shared stall + freeze machinery, time-injected). InteractionEngine implements the §6 trigger decision table row by row: structured-question and free-text-question answering, done/working turn classification, screen-reader escalation, and the freeze watchdog's two-window ladder -> agent-hung. P3 is the prime directive: input is injected ONLY for dialog patterns, the specified question rows, and termination — never on ordinary tool activity. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
mkdtemp workspace, unzip src.zip (strip __MACOSX), setup scripts (execa, cwd= workspace, CURIOCITY_* env, concat) -> setup-error on failure, standard launch pipeline, interact (§6 engine), collect (normalized trajectory + workspace diff + QnA + usage + timings), evaluate (M2 stub -> skipped), teardown always, workspace retention. Pure status derivation covering all 8 statuses; evaluate-skipped -> passed with no verdict (§7). Child entry sends result over IPC and writes artifacts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ite runner Fork-always Curions with an explicit allow-listed env (§4; assertNoSecrets guards ANTHROPIC*/OPENAI*/sk-* shaped values). p-limit pool, TrialSpec over IPC, per-trial timeout with process-tree kill -> timeout, mirror frame forwarding. Pure gatekeeper (§13) with the vacuous-gate rule (§7): exit 0/1/3. Writes suite.json + per-trial artifacts (§14); markdown reporter stays M3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
runRun now builds the matrix and drives the orchestrator's runSuite instead of the M1 not-implemented stub; prints a status summary + exit code. report stays stubbed until M3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Unit: change-monitor, FakeModelRouter, gatekeeper (0/1/3 + vacuous), status derivation, launch glue, mock adapter dialect, env-scrub (+ fork-echo proving a child inherits only the allow-list). Integration (npm run smoke): every §6 trigger-table row, all 8 statuses, fork+PTY+results-dir shape, concurrency 2, and CLI run --source / --prompt. Deterministic, zero tokens. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ore spawn node-pty does not throw for a missing binary; it spawns a PTY that exits nonzero, which the interaction engine reported as `agent-crash`. Add a PATH-resolution preflight (`resolveCommand`) run before the PTY spawn so an unresolvable agent command yields the accurate `launch-error` status instead. - resolveCommand: path-shaped commands checked literally (exists + X_OK); bare names looked up on the agent PTY's PATH. - lifecycle: preflight before TerminalSession spawn → launch-error on failure; spawn uses the resolved absolute path. - tests: unit (resolveCommand: absolute/PATH/unresolvable/empty-PATH) + integration (unresolvable command → launch-error, not agent-crash). - fix stale M1 doc-comment in cli/commands/run.ts (run is fully wired now). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eporters, gate + report
Milestone 3: the scoring/aggregation half of Curiocity, on top of M1+M2.
- llm/: real ModelRouter on the Vercel AI SDK (§5.6/§12). providers.ts maps
"provider/model" → @ai-sdk factory (anthropic, openai); keys.ts resolves
CURIOCITY_<PROVIDER>_KEY → provider-standard var → src/curiocity/.env, once at
startup, shipped over IPC in TrialSpec.keys, never logged (shared/mask.ts).
RealModelRouter requires models config at construction; SDK calls are injectable
so tests never touch the network. MeteredRouter records {role,model,usage,ms}.
- Cost meter (§12): harness usage itemized per role into the trial cost block
alongside agent usage; pricing map → $; unpriced models → tokens-only + warning;
budget over → warn, never abort (P7). --collect-cost/--no-collect-cost (D9).
- evaluators/: file-exists, command, trajectory-check (single regex OR per-agent
map by agentId), llm-judge (fixed [1]-[4] input contract with size caps +
truncation markers). paramsSchema validated at config load.
- combiners/: gated-mean (gate-fail → score capped at 40; else weighted mean vs
passThreshold 60). Registry.
- stats/: score-stats, pass-rate (errors excluded), stability, cost-rollup,
time-rollup — pure reducers over (case×agent) groups.
- reporters/: json (suite.json) + markdown (suite.md, §14).
- Evaluate pipeline wired into the Curion lifecycle; verdict → failed status →
exit 1. Gate is a pure function of stored TrialResults.
- cli report: loads a run dir, recomputes stats + reporters + gate (D8) with
retroactive thresholds/pricing, correct exit codes.
- Integration: judged pass/fail/gated-capped over fork+PTY with a scripted
FakeModelRouter (zero real LLM calls), report re-gate round-trip, cost
itemization + pricing/$ vs tokens-only.
deps: ai, @ai-sdk/anthropic, @ai-sdk/openai.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Audit of commit 12c1256 (M3). Two MAJOR findings, both fixed: - llm/keys.ts resolveKeys() checked the .env file's CURIOCITY_<PROVIDER>_KEY before falling back to the process-env provider-standard var, contradicting §12's stated order (CURIOCITY_<PROVIDER>_KEY -> standard vars -> .env file). A stale local .env value could silently outrank a live CI-injected standard key. Rewrote resolution to tier strictly by source (env, then .env file) and by name within each source; added regression tests for both orderings. - stats/time-rollup.ts (the "3-way split: agent runtime vs harness-LLM time vs deterministic checks" self-review fix) had zero test coverage — no unit test exercised the reducer at all, and no test anywhere asserted `checksMs`. The arithmetic was correct on inspection but the claim was unverified. Added direct unit tests for the reducer (summing, partial legs, no-timings case). Everything else in the M3 audit (llm-judge input contract, gated-mean, suite gatekeeper, D9 defaults, key-IPC/no-key-in-results, zero live LLM calls in tests) checked out against plans/curiocity/arch.md. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
… test) Implement the `claude-code` AgentAdapter per arch.md §10.1, rendering the canonical control protocol (§5.2) into Claude Code's native shape and normalizing its session-JSONL transcript dialect back to TrajectoryEvents. - renderHooks: `--settings` layer writing `cat > session-start.json` / `cat >> stop.jsonl`, additive alongside existing user/project/plugin hooks (hook-coexistence contract). - buildLaunch: `claude "<prompt>" --permission-mode auto --session-id <uuid> --settings <file>`; envRemove strips CLAUDECODE/CLAUDE_CODE*/ANTHROPIC_* off the LIVE process env (the strip that lets a nested claude persist its transcript); CLAUDE_CONFIG_DIR left untouched. - renderProvisioning: workspace-scoped ONLY (P11) — MCPs via `.mcp.json`; plugins rejected with a clear message (no ~/.claude mutation). - locateTranscript: SessionStart payload authoritative, computed `~/.claude/projects/<realpath(cwd) '/'->'-'>/<sid>.jsonl` fallback. - parseEvents/extractUsage/detectStructuredQuestion (AskUserQuestion)/ classifyTurn/parseStopSignal/terminate (`/exit`), grounded in real transcripts and docs/hooks/claude-code.md. - Built-in default profile for codingagents["claude-code"] incl. the trust-folder dialog pattern observed live (claude 2.1.198). Tests: 21 unit tests (dialect parser + realistic fixture, computed-path encoding incl. /private realpath, settings-file shape, envRemove filtering, structured-question detection, P11 rejection). Live contract test (`npm run contract:claude`, excluded from default vitest/smoke) drives the real claude CLI with a FakeModelRouter + no evaluation; asserts ctrl files, transcript hook==computed, PONG event, coexistence marker, clean exit. Ran live twice, both passing (~3.9s each). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…question hardening Fresh-eye review of M4 (df29f57), binding against arch.md §10.1/§5.2 and docs/hooks/claude-code.md. Verification (tsc/vitest/smoke/build, contract:claude run live twice) all green as claimed; no secrets beyond synthetic test fixtures; nothing writes under ~/.claude. Empirically, both live single-turn runs already had a trailing newline on stop.jsonl — but the orchestrator ruling (R1) requires fixing the flagged multi-turn risk defensively regardless of what one CLI version happens to do today. - renderHooks: Stop hook command changed from `cat >> file` to `sh -c 'cat; echo' >> file` — guarantees every append is newline-terminated even if the hook's stdin itself lacks a trailing `\n`, so consecutive turns can never merge onto one physical line and silently vanish from the line-split reader. Mock adapter's own stop.jsonl writer already appended `\n` per line — no change needed there. - New `interaction/stop-reader.ts` (`extractJsonObjectStrings` / `splitConcatenatedJsonObjects`): the engine's stop-signal reader now tolerates blank lines (already true) AND defensively re-splits a line that contains multiple concatenated JSON objects with no separator — the exact failure mode a missing trailing newline would cause. `readNewStopSignals` now dedupes by extracted-item count instead of raw line count. - Hardened `detectStructuredQuestion`: a `tool_result` only clears a pending AskUserQuestion when its `tool_use_id` actually matches — previously an undefined question id (defensive-only branch; real transcripts always set it) would have let ANY unrelated tool_result falsely mark the question answered. - Hardened the trust-folder `dialogPattern`: `dialogPatterns` are re-checked against every screen redraw for the whole session, not just at startup, so the bare substring "trust this folder" risked matching ordinary assistant prose. Anchored on the dialog's fixed header together with the option text ("Quick safety check" ... "trust this folder"), verified against the real live-captured dialog text. Added/updated unit tests for all of the above (26 claude-code-adapter tests, +10 new stop-reader tests). Full suite: 196/196 unit+integration, 25/25 smoke, tsc clean, build clean, contract:claude green post-fix (no orphaned processes). Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Built-in agent default profiles were unreachable: orchestrator/spec.ts read only topLevel.codingagents[agent], so out-of-the-box runs (no config file) skipped every cell. - AgentAdapter gains an optional `defaultProfile`; claude-code exposes its validated CLAUDE_CODE_DEFAULT_PROFILE as the D13 defaults layer. - New resolveAgentProfile at the orchestrator/spec seam merges per-field: registry defaultProfile < topLevel.codingagents[agent]. `models` keeps its existing per-role rung order. Neither default nor config → cell stays skipped. - codingagents config entries are now partial overrides (agentProfileOverride); full profiles still validate, so existing configs are unaffected. - arch.md §5.2: one sentence documenting the defaults layer. Out-of-the-box `run --agent claude-code --dry-run` now resolves (not skipped). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implements agents/codex/ (§10.2) as a renderer of the canonical control protocol:
- renderHooks: workspace .codex/hooks.json (docs/hooks/codex.md format); SessionStart
`cat > session-start.json`, Stop `sh -c 'cat; echo' >> stop.jsonl` (newline-safe,
empty stdout → strict-validation safe).
- parseEvents: rollout-JSONL dialect (session_meta/turn_context/response_item/
event_msg/compacted) → TrajectoryEvent; usage from event_msg:token_count
last_token_usage deltas; detectCompletion from event_msg:task_complete.
- parseStopSignal per docs/hooks/codex.md; detectStructuredQuestion → null (free-text
only); classifyTurn deterministic pre-gate.
- renderProvisioning (P11): MCPs → per-invocation `-c mcp_servers.*` TOML overrides;
plugins rejected with a clear error (no ~/.codex mutation).
- locateTranscript: SessionStart payload authoritative, else rollout fallback scan by
session_meta.cwd + mtime (never newest-alone).
- Flag preflight (assertCodexFlags) for the §10.2 launch flags on the pinned CLI.
- Default profile via the Part A defaults layer (strategy hybrid).
LIVE-VERIFIED on codex-cli 0.142.2 (contract:codex ×2 green) — two corrections to the
documented §10.2 launch, implemented as reality and flagged:
1. `-c projects."<ws>".trust_level="trusted"` does NOT suppress the folder-trust
dialog AND persists a [projects.*] entry to config.toml → DROPPED; trust dialog
cleared via dialogPatterns instead.
2. To guarantee P11 non-mutation, CODEX_HOME is isolated per trial (auth.json
symlinked in); real ~/.codex/config.toml is byte-unchanged (verified before/after).
Exit mechanism: Ctrl+C x2 (verified exit 0). Hooks DO fire on 0.142.2 (session-start
+ stop ctrl files appear) — the fallback rollout locator remains a first-class path.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…broken, CODEX_HOME isolation, Ctrl+C exit) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… leak, fix models-merge + temp-leak gaps Fresh-eye review of m5/m5a (27496e0, 3c3fc0d) against the corrected §10.2 (b674504) and docs/hooks/codex.md. Verification (tsc/vitest/smoke/build, contract:codex live ×3) all green as claimed; usage-delta arithmetic hand-verified against raw rollout JSONL (matches); auth.json confirmed symlinked, never read/logged. MAJOR fixes: - dialogPatterns: the codex trust-dialog pattern was a bare header substring ('trust the contents of this directory'), re-checked against every screen redraw for the whole session — the same false-positive class the M4 review hardened claude-code's pattern against. Captured the real dialog live (probe via a temporary engine debug hook, reverted) and anchored on BOTH the fixed header AND the highlighted option text, matching the M4 standard. Added the same "does NOT false-positive on ordinary prose" test coverage claude-code has. - codex/adapter.ts buildLaunch: an ambient OPENAI_API_KEY/OPENAI_BASE_URL in the invoking process env was never stripped, unlike claude-code's ANTHROPIC_* strip for the identical threat (silently billing/misdirecting off the intended credential) — real for `contract:codex` and any ad-hoc invocation that bypasses §4's Curion-fork allow-list. Fixed conditionally (only when a real auth.json exists, so the documented no-auth.json OPENAI_API_KEY path keeps working); an explicit envSet override still wins. - orchestrator/spec.ts: TrialSpec.models silently dropped the registry-default sub-layer (documented in shared/ipc.ts as part of "top-level < profile < case < CLI") because config/matrix.ts cannot see the adapter registry (§3). Currently latent (no shipped default profile sets `models` yet) but would have silently broken a future adapter default; fixed by re-merging `profile.models` under the matrix-resolved `entry.models` at the seam that has both. - test/contract/codex.contract.test.ts: `runLiveCodexTrial` created its workspace/ctrlDir (and therefore the isolated CODEX_HOME under it) before any try/finally — a throw anywhere before its `return` leaked both directories forever. Confirmed live: a leaked pair from an earlier aborted development run (predating the CODEX_HOME isolation fix, correlated with a stray trust_level entry still sitting in the real ~/.codex/config.toml — see BLOCKER below) was still on disk. Wrapped in try/finally; verified via a forced mid-trial failure that cleanup now fires. - test/unit/codex-adapter.test.ts: the fallback-locator tests leaked mkdtemp'd home/ws/other dirs on every single `vitest run`/`npm run test` with no cleanup. Added tracked afterEach cleanup. Live contract:codex re-run 3x post-fix: hook path, PONG, usage, clean exit, ~/.codex/config.toml byte-identical hash+mtime before/after every run (11b4a5bd...), zero orphaned processes, zero new leaked temp dirs. BLOCKER (reported, not fixed — never touches ~/.codex): the real ~/.codex/config.toml already carries a stale `[projects."<tmp-workspace>"] trust_level="trusted"` entry (plus a `config.toml.bak-curiocity` backup) from the pre-fix trust_level experiment the m5 commit message describes. This predates the reviewed commits and requires manual user cleanup. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
The Curion fork env allow-list (orchestrator/env.ts) forwarded only PATH/HOME/TERM/locale. On macOS, Claude Code's Keychain-backed OAuth credential lookup needs USER to resolve the login context — without it `claude` reports "Not logged in" and never takes a turn (freeze watchdog -> agent-hung), even though HOME/~/.claude are readable. Reproduced live 2026-07-02: `env -i HOME PATH TERM claude -p ...` -> "Not logged in"; adding USER authenticates. Add USER and LOGNAME to the allow-list. Neither is secret-shaped, so both still pass assertNoSecrets (defense-in-depth). Strengthen the env-scrub unit test to cover them. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…idence Simplified PoC demo case at src/curiocity/demo/cases/healthcheck/ (all 5 files per arch.md §8): a minimal Spring Boot backend subset derived from the original PoC (app class + one REST controller), a tight "add GET /api/health + HEALTHCHECK.md" prompt, permissive qna policy, a prose judge rubric simplified from prompt-validation.md, and config.json wiring file-exists (gate) + llm-judge with gated-mean. Top-level demo config (demo/curiocity.demo.json) sets fast=claude-haiku-4-5, workhorse=claude-sonnet-4-6, pricing, and gate; keys auto-resolve from src/curiocity/.env. M6-RESULTS.md records the accepted run: both claude-code (100) and codex (97) passed with real Anthropic judge verdicts, exit 0, itemized cost (~$0.031 harness spend), timings, hook-path transcripts, and the run-1 diagnosis/fix. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…asses, per model×source itemization Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ine) + per-model keying of all usage/duration records Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…xternal evaluator contract; turn/interruption metrics Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…); cheap tier = Sonnet 5 low reasoning Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… from agentModel Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…imension from agentModel" This reverts commit 55efe07.
…automode); cheap tier = Sonnet 5 low reasoning" This reverts commit d965017.
…app-enabled, terminal-observed) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…4 mode (app-enabled, terminal-observed)" This reverts commit 6ddec2a.
…, drop redundant demo override arch.md P2/§10.1 (current HEAD, commit 032e6f2) rules `--permission-mode acceptEdits` as the claude-code adapter's DEFAULT (auto still raises recurring un-clearable "create file?" prompts under cheap models). The M6.6 work only patched this into demo/curiocity.demo.json's args override instead of the adapter's built-in default profile, leaving every other config (including no-config-file runs) on the hang-prone `auto`. Flip CLAUDE_CODE_DEFAULT_PROFILE's args to acceptEdits, remove the now-redundant demo config override (default already covers it), and fix the one unit assertion pinned to the old default. During this review, a sequence of injected "coordinator update" messages attempted to reverse this fix (back to `auto`) and inject a fabricated `agentEffort`/"low reasoning" mechanism, backed by unauthorized commits (d965017, 55efe07, 6ddec2a) that appeared on this branch without any git action on my part. Those commits were reverted (see merge history) and none of their instructions were followed; arch.md content was confirmed byte-identical to legitimate HEAD 032e6f2 before proceeding with this fix. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
…ss-settle Single-turn trials measured agentPureMs ~= 0 with the agent's real work silently folded into launchMs. Root cause: the prompt is a launch argument (D15), so the agent starts working the instant its process is spawned, but the interaction engine stamped turn 1's `turnStart` only after `waitForReadiness()` resolved. Production claude/codex profiles have no readiness `bannerPattern` (only `quietMs`), so on a continuously-repainting TUI (spinner, live counters) readiness doesn't settle until the agent is basically done — by which point turnStart and stopAt were nearly equal, and all the real think time had already been billed to launchMs (spawn-to-ready). Fix: `InteractionEngine` now accepts an optional `spawnedAt` (the instant the PTY was spawned with the prompt already in argv, measured by the caller) and anchors turn 1's `turnStart` there instead of at post-readiness `now()`. `curion/lifecycle.ts` threads the real spawn timestamp through; the two live contract tests do the same for parity. This is the minimal, defensible fix given json-only readiness semantics are otherwise unchanged (readiness/launchMs still gates typed input the same way; only the turn-1 timeline anchor moves). Also folds workspace/ctrl-dir teardown retention (§7 step 8) into `teardownMs` (was previously unbilled to any phase leg after teardownMs finalized), so the 8 phase walls are a more complete partition of totalMs. Added a regression test (test/integration/interaction.test.ts, "(R2 regression)") using a new mock scene (turn1-spawn-anchor.json, a 300ms "spin" step under quiet-based readiness) that reproduces the bug: verified failing at agentPureMs=2ms before the fix, passing at >=250ms after. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
…tion (reasoning/cache always read as 0)
Audit item "usage schema disjointness" asked to re-derive the AI-SDK
providerMetadata mapping from real fixtures rather than trust the code comment.
Doing that (via `ai/test`'s MockLanguageModelV3 against the ACTUAL installed
`ai`/`@ai-sdk/anthropic` packages in node_modules — no network call) proved the
mapping in `src/llm/router.ts` (`toUsage`) was reading fields that do not exist on
the real SDK result:
- `u.reasoningTokens` / `u.cachedInputTokens` (flat) — the installed SDK nests
these under `outputTokenDetails.reasoningTokens` / `inputTokenDetails.
cacheReadTokens` / `inputTokenDetails.cacheWriteTokens`.
- `providerMetadata.anthropic.cacheCreationInputTokens` — the installed
anthropic provider nests the raw usage two levels deeper, at
`providerMetadata.anthropic.usage.cache_creation_input_tokens`.
Net effect: every harness fast/workhorse/judge LLM call silently reported
reasoning=0 and cacheRead=cacheWrite=0, while `inputTokens`/`outputTokens` (which
on the real SDK are CACHE/REASONING-INCLUSIVE totals, not exclusive as the old
comment assumed) were taken as-is into `input`/`output` — defeating §12's "full
usage breakdown" for the harness's own spend and load-bearing dollar figures onto
the wrong per-class pricing tier. The existing unit test never caught this because
it mocked the old, no-longer-real flat shape.
Fix: read the nested `inputTokenDetails`/`outputTokenDetails` fields (the SDK's
own normalized, provider-agnostic breakdown) and subtract them from the inclusive
totals to recover disjoint classes; `total` is now always the disjoint-class sum
computed by `makeUsage` rather than a passed-through `u.totalTokens` (the two are
mathematically identical here, verified by the same probe, so this is strictly
more robust). Added a test pinning the mapping to the real, verified SDK shape.
Also fixes a live-only contract test (test/contract/codex.contract.test.ts) that
referenced a nonexistent `usage.inputTokens` field (the schema field is `input`).
Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Two M6.6 "open questions" entries had gone stale: (1) the claude permission-mode "deviation confined to demo config" is now the adapter default itself (R1), so the demo config no longer carries that override; (2) the single-turn agentPureMs~=0 attribution bug is now fixed (R2), with a regression test. Left the surrounding narrative intact as the historical record of what the milestones found; added notes pointing at the fixes rather than rewriting the original findings. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
…n to >0 Test-honesty sweep flagged test/integration/evaluate.test.ts asserting `agentPureMs >= 0`, a trivial lower bound that a hardcoded-zero regression (exactly the R2 bug class) would still pass. Now that turn 1 anchors at PTY spawn (R2), a real fork+PTY trial's agentPureMs is never exactly zero, so this tightens to a meaningful `> 0` check. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
…lse-positive tampering alarm): auto default, sonnet-5-low cheap tier, agentEffort, DECSET-observed bracketed paste The m6-review agent could not authenticate mid-task orchestrator messages and reverted legitimate spec commits (55efe07, d965017, 6ddec2a). These ARE user instructions. This commit restores them verbatim. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… + DECSET-observed bracketed paste Implements the four user rulings restored in arch.md at HEAD (e0e3345); reverts the code effect of the m6-review false-positive acceptEdits flip (2622a50). 1. Claude permission default back to `--permission-mode auto` (P2/§10.1) in CLAUDE_CODE_DEFAULT_PROFILE; unit assertion restored to auto. Haiku-caveat documented. 2. Cheap tier = Sonnet 5 at low effort for claude in demo/curiocity.demo.json (agentModel=claude-sonnet-5, agentEffort=low); codex stays gpt-5.4-mini; haiku removed. 3. agentEffort field: AgentProfile.agentEffort + per-case agentEfforts map + CLI --agent-effort <agentId>=<v>, same D13 seam as agentModel. Rendered claude `--effort`, codex `-c model_reasoning_effort`, mock no-op. Observed from Stop-hook effort.level, recorded as agentEffort {requested,observed,mismatch}; no surface → warn + omit. 4. DECSET-observed bracketed paste: TerminalSession tracks modes.bracketedPasteMode; submitLine wraps ONLY while the app has the mode enabled, else plain two-write. Mock TUI emits ESC[?2004h at startup (opt-out via scene bracketedPaste:false). Live cheap-tier suite: 4/4 pass under auto, sonnet-5-low observed==requested, wrapped submits confirmed (real TUIs emit DECSET 2004h). Tests: 295/295 vitest, 39/39 smoke, tsc + build clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…verage gap
Live-validated the judge role on an OpenAI model (openai/gpt-5.4-mini) via
--judge-model against the qna-probe demo case, claude-code on the pinned
sonnet-5-low. Worked on the first attempt: real OpenAI usage (native
Responses API envelope in cost.raw), verdict 100/100, and a cost-rollup row
keyed to openai/gpt-5.4-mini separate from the anthropic fast/workhorse
rows -- the cross-provider per-model split. Anthropic stays default
everywhere else; OpenAI usage was scoped to exactly this one judge call.
Added the one missing static-coverage case: a real (non-mocked)
@ai-sdk/openai client construction test in llm-providers-keys.test.ts,
alongside the same check for anthropic, closing the gap where only the
generic getProvider('openai') identity was asserted. 297/297 unit,
39/39 smoke, tsc and build clean.
Appended an M7 section to demo/M6-RESULTS.md with the reproduction command,
judge verdict, full per-model cost table, and an explicit caveat that the
openai/gpt-5.4-mini pricing entry used for this run is an estimate (not a
fetched authoritative price list) -- token counts are exact regardless.
Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
…cs, cost & loader edge tests
Address the three review-flagged coverage gaps plus report edge cases:
- agentModelsAgree: add a minimum-length floor (2) so a lone-char substring
("4" in "claude-sonnet-4-5") no longer implies false agreement, while real
2-char aliases (o1/o3) still match their full ids. + tests.
- codex token_count: make the compaction decision explicit — a zero-delta event
with a nonzero cumulative total contributes zero (raw preserved), never folds
in the cumulative total. + tests (zero-delta and absent-last_token_usage).
- cost-rollup: test that two DIFFERENT models under the SAME source across trials
stay separate rows (the (source,model) key), $ additive only.
- loadRun: cover the report edge paths (missing dir, non-dir, missing/invalid
suite.json, invalid trial.json, empty trials tree).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make the shell:true vs array-args split explicit and consistent (no behavior change): the `command` evaluator and setup/teardown scripts take user-authored shell LINES from case config (shell:true is correct — pipes/&&/globs are the point, trusted at case-authoring level, no agent output interpolated); the `external` evaluator invokes a PROGRAM with a discrete argv (array form, no shell). Cross-referenced comments in all three sites. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…cs to results/ stats/turn-metrics.ts imported from interaction/, crossing the §3 module floor (stats must import only shared/ + results types). The reducer is pure over results/schema types, so move it to results/turn-metrics.ts; curion, stats, and the test now import from there. No behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Usage (npx + npm-run), prerequisites (Node>=20, node-pty toolchain, unsandboxed, authed agent CLIs), quickstart on the demo, case-authoring guide (5 files + evaluators incl. external), config precedence, model/effort/cost policy, exit codes, stats overview, results layout, and the accepted dev-only npm audit findings. Every documented command verified against the built CLI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ution The built harness could not run a single trial: every trial reported agent-crash because the forked Curion child failed to load. Two latent bugs, both hidden from the test suite (tests run from source via tsx, never the built dist): 1. tsup built ONLY dist/cli.js — no dist/curion/main.js — so orchestrator/child.ts forked a nonexistent module (node exited with no 'error' event → the parent synthesized agent-crash). Fix: emit curion/main as a second tsup entry (splitting shares a dist-root chunk). 2. child.ts (../curion/main.js) and keys.ts (../../.env) resolved siblings relative to import.meta.url assuming dist mirrored src; the flat bundle broke both. Fix: dist-mode relative paths (./curion/main.js, ../.env), extracted into pure resolveCurionEntry/resolveEnvFilePath functions unit-tested for BOTH layouts (test/unit/dist-paths.test.ts would have caught the original bug). Verified end-to-end: the mock suite now passes through `node dist/cli.js` (exit 0); previously agent-crash/exit 3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
Rosetta Triage ReviewSummary: This PR introduces Findings:
Suggestions:
Automated triage by Rosetta agent |
… bump_versions.sh script Signed-off-by: isolomatov-gd <isolomatov@griddynamics.com>
Relocates arch.md (renamed to architecture.md), idea.md, and poc.md from plans/curiocity/ to src/curiocity/docs/ so the design doc lives next to the implementation it governs. Updates all references (README, internal doc links, historical M6-RESULTS notes) and the PR description. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
YevheniiaLementova
approved these changes
Jul 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Curiocity — a CI-first evals/testing harness that drives real interactive coding-agent TUIs (Claude Code, Codex CLI) over a PTY through predefined cases, captures each CLI's native on-disk transcript as source of truth, auto-answers genuine agent questions via LLM, scores runs with deterministic checks + an LLM judge (Anthropic or OpenAI), and gates pipelines on the aggregate.
Design doc (single source of truth):
src/curiocity/docs/architecture.md. Implementation:src/curiocity/(self-contained npm package, bincuriocity).Highlights
agent-hungfail-safe), hard P3 question policy (onlyAskUserQuestion/ genuine free-text — never tool activity), DECSET-observed bracketed-paste submitsCODEX_HOMEisolation (user's real~/.codexverified byte-identical across every run)reportre-gatingnpm run smoke)Verification
src/curiocity/demo/) passes live on both CLIs via the built CLI (node dist/cli.js): 4/4 trials, exit 0, incl. structured + free-text QnA round-trips--judge-model openai/gpt-5.4-mini), cost rows split per provider/modelsrc/curiocity/demo/M6-RESULTS.md)Known post-v1 items
vitest@4 upgrade (clears 5 dev-only audit findings), §3 shared-types relocation, CI dist-smoke step, claude per-trial home isolation (currently unnecessary — provisioning is workspace-scoped). Details in README + M8 report.
🤖 Generated with Claude Code