Curiocity v1 — evals harness for interactive coding-agent CLIs (Claude Code + Codex) by isolomatov-gd · Pull Request #125 · griddynamics/rosetta

isolomatov-gd · 2026-07-03T12:26:48Z

What

Curiocity — a CI-first evals/testing harness that drives real interactive coding-agent TUIs (Claude Code, Codex CLI) over a PTY through predefined cases, captures each CLI's native on-disk transcript as source of truth, auto-answers genuine agent questions via LLM, scores runs with deterministic checks + an LLM judge (Anthropic or OpenAI), and gates pipelines on the aggregate.

Design doc (single source of truth): src/curiocity/docs/architecture.md. Implementation: src/curiocity/ (self-contained npm package, bin curiocity).

Highlights

Fork-per-trial orchestration — bounded pool, env-scrubbed children (secrets travel via IPC only, never env), per-trial timeout with process-tree kill
Interaction engine — deterministic-first: stall detector + freeze watchdog (agent-hung fail-safe), hard P3 question policy (only AskUserQuestion / genuine free-text — never tool activity), DECSET-observed bracketed-paste submits
Both adapters live-validated — hook-based capture (SessionStart/Stop), computed-path/rollout fallbacks, per-trial CODEX_HOME isolation (user's real ~/.codex verified byte-identical across every run)
Evaluators — file-exists, command, trajectory-check (per-agent patterns), llm-judge (fixed 4-part input contract), and hook-style external evaluators (stdin JSON paths → 0-100 metrics)
Stats — per model×source token classes (input/output/reasoning/cacheWrite/cacheRead + raw), measured time decomposition (pure agent vs harness reaction, per-turn timeline), turn/interruption metrics, tiered pricing, stability classification, retroactive report re-gating
agentModel/agentEffort — pin the agent CLI's own model and reasoning effort per profile/case/CLI; requested-vs-observed recorded per trial
Mock agent — scripted TUI fixture; the whole engine is integration-tested token-free (npm run smoke)

Verification

312 unit/integration + 39 smoke tests, 0 skipped; tsc/build clean from fresh install
Demo suite (src/curiocity/demo/) passes live on both CLIs via the built CLI (node dist/cli.js): 4/4 trials, exit 0, incl. structured + free-text QnA round-trips
Cross-provider judge verified live (--judge-model openai/gpt-5.4-mini), cost rows split per provider/model
Every milestone independently reviewed by a fresh-context agent; all claims re-verified (evidence trail in src/curiocity/demo/M6-RESULTS.md)

Known post-v1 items

vitest@4 upgrade (clears 5 dev-only audit findings), §3 shared-types relocation, CI dist-smoke step, claude per-trial home isolation (currently unnecessary — provisioning is workspace-scoped). Details in README + M8 report.

🤖 Generated with Claude Code

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Create the curiocity npm package (ESM, Node >=20, bin curiocity -> dist/cli.js) with tsup/vitest/tsc tooling and the shared/ layer per arch.md §3/§5: - generic Registry<T> (§5.1) and error classes incl. UnknownIdError (known-ids list) - pino logger util; Role/ModelRoles schema (§5.6) - TrajectoryEvent + QnaEntry + Usage zod schemas (§5.2) - IPC message types and TrialSpec (§4, zod); MatrixCell shape Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

config/ (§5.2, §9, D13/D14): - zod schemas: AgentProfile, top-level config, case config, pricing, gate - loader (top-level, optional file) - precedence merge defaults < top-level < case < CLI; provisioning merge-by-name; setup/teardown CONCAT (never override); per-role models merge - pure matrix resolution (agent × case × repeat) folding the full model chain top-level < profile < case < CLI cases/ (§8, D7): - discovery (immediate subfolders, all-5-files rule, skip-with-reason) - case validation (missing-files vs invalid-config reasons) - ephemeral (inline) case builder with neutral defaults Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

results/ (§14, D8): - trial.json / suite.json zod schemas with schemaVersion (statuses incl. agent-hung) - run-dir store (timestamped dir, trial artifacts, suite files) - loader for `report` cli/ (§13, D4): - commander program: run / report / validate - validate: fully functional discovery + skip reasons - run: parses+validates config, resolves the matrix, prints it for --dry-run (suite vs inline via flags); exits "not implemented" after resolution otherwise - report: loads a run dir and prints a status summary; recompute stubbed - §13 exit codes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

vitest unit tests (§15): precedence merge incl. D13 (scalar/provision-by-name/ models) and D14 (setup/teardown concat), case discovery (valid/skip reasons), ephemeral builder, all zod schemas (accept/reject), results store roundtrip. test/fixtures/cases: hello-world (valid), incomplete (missing files), broken-config (schema failure) — drive `validate` and `run --dry-run`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…recedence in arch.md Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

node-pty, @xterm/headless, execa, extract-zip, p-limit, strip-ansi (§16). smoke -> vitest run test/integration; vitest now includes integration. postinstall restores the executable bit npm can strip off node-pty's prebuilt spawn-helper (D12), which otherwise fails pty.fork. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…or the runtime ModelRouter interface (§5.6) as a port only (real AI-SDK impl is M3), plus a scripted, per-call FakeModelRouter test util that throws on any unscripted call (catches P3-violating injections). TrialSpec gains the fields the orchestrator/ curion runtime needs: profile (opaque), runDir, keep/mirror/evaluate flags, and an optional fakeRouter seam for token-free tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Pane interface (one pane in v1, panes[]/primary seam per §16). Rendered ANSI-free snapshot of the visible grid. write() chunks + yields the event loop to honor backpressure while the read loop always drains (§4); submitLine() applies the profile submit sequencing (enter | paste+enter). CJS deps default-imported so the forked child's native ESM resolves them. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…kAdapter AgentAdapter contract + core-owned canonical control protocol (CanonicalHookSpec/ StopSignal, LaunchPlan), composeLaunchPlan glue (the identical 3-step prepare for every adapter), env filtering, and the agent registry. Mock agent (§10.3, D10): a scripted zero-dep TUI driven by a JSON scene that itself writes session-start.json + stop.jsonl like real hooks, plus a full MockAdapter incl. a deterministic task_complete completion detector (P4 / §10.2) so runs need zero LLM. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…loop ChangeMonitor (shared stall + freeze machinery, time-injected). InteractionEngine implements the §6 trigger decision table row by row: structured-question and free-text-question answering, done/working turn classification, screen-reader escalation, and the freeze watchdog's two-window ladder -> agent-hung. P3 is the prime directive: input is injected ONLY for dialog patterns, the specified question rows, and termination — never on ordinary tool activity. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mkdtemp workspace, unzip src.zip (strip __MACOSX), setup scripts (execa, cwd= workspace, CURIOCITY_* env, concat) -> setup-error on failure, standard launch pipeline, interact (§6 engine), collect (normalized trajectory + workspace diff + QnA + usage + timings), evaluate (M2 stub -> skipped), teardown always, workspace retention. Pure status derivation covering all 8 statuses; evaluate-skipped -> passed with no verdict (§7). Child entry sends result over IPC and writes artifacts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ite runner Fork-always Curions with an explicit allow-listed env (§4; assertNoSecrets guards ANTHROPIC*/OPENAI*/sk-* shaped values). p-limit pool, TrialSpec over IPC, per-trial timeout with process-tree kill -> timeout, mirror frame forwarding. Pure gatekeeper (§13) with the vacuous-gate rule (§7): exit 0/1/3. Writes suite.json + per-trial artifacts (§14); markdown reporter stays M3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

runRun now builds the matrix and drives the orchestrator's runSuite instead of the M1 not-implemented stub; prints a status summary + exit code. report stays stubbed until M3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Unit: change-monitor, FakeModelRouter, gatekeeper (0/1/3 + vacuous), status derivation, launch glue, mock adapter dialect, env-scrub (+ fork-echo proving a child inherits only the allow-list). Integration (npm run smoke): every §6 trigger-table row, all 8 statuses, fork+PTY+results-dir shape, concurrency 2, and CLI run --source / --prompt. Deterministic, zero tokens. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ore spawn node-pty does not throw for a missing binary; it spawns a PTY that exits nonzero, which the interaction engine reported as `agent-crash`. Add a PATH-resolution preflight (`resolveCommand`) run before the PTY spawn so an unresolvable agent command yields the accurate `launch-error` status instead. - resolveCommand: path-shaped commands checked literally (exists + X_OK); bare names looked up on the agent PTY's PATH. - lifecycle: preflight before TerminalSession spawn → launch-error on failure; spawn uses the resolved absolute path. - tests: unit (resolveCommand: absolute/PATH/unresolvable/empty-PATH) + integration (unresolvable command → launch-error, not agent-crash). - fix stale M1 doc-comment in cli/commands/run.ts (run is fully wired now). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…eporters, gate + report Milestone 3: the scoring/aggregation half of Curiocity, on top of M1+M2. - llm/: real ModelRouter on the Vercel AI SDK (§5.6/§12). providers.ts maps "provider/model" → @ai-sdk factory (anthropic, openai); keys.ts resolves CURIOCITY_<PROVIDER>_KEY → provider-standard var → src/curiocity/.env, once at startup, shipped over IPC in TrialSpec.keys, never logged (shared/mask.ts). RealModelRouter requires models config at construction; SDK calls are injectable so tests never touch the network. MeteredRouter records {role,model,usage,ms}. - Cost meter (§12): harness usage itemized per role into the trial cost block alongside agent usage; pricing map → $; unpriced models → tokens-only + warning; budget over → warn, never abort (P7). --collect-cost/--no-collect-cost (D9). - evaluators/: file-exists, command, trajectory-check (single regex OR per-agent map by agentId), llm-judge (fixed [1]-[4] input contract with size caps + truncation markers). paramsSchema validated at config load. - combiners/: gated-mean (gate-fail → score capped at 40; else weighted mean vs passThreshold 60). Registry. - stats/: score-stats, pass-rate (errors excluded), stability, cost-rollup, time-rollup — pure reducers over (case×agent) groups. - reporters/: json (suite.json) + markdown (suite.md, §14). - Evaluate pipeline wired into the Curion lifecycle; verdict → failed status → exit 1. Gate is a pure function of stored TrialResults. - cli report: loads a run dir, recomputes stats + reporters + gate (D8) with retroactive thresholds/pricing, correct exit codes. - Integration: judged pass/fail/gated-capped over fork+PTY with a scripted FakeModelRouter (zero real LLM calls), report re-gate round-trip, cost itemization + pricing/$ vs tokens-only. deps: ai, @ai-sdk/anthropic, @ai-sdk/openai. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Audit of commit 12c1256 (M3). Two MAJOR findings, both fixed: - llm/keys.ts resolveKeys() checked the .env file's CURIOCITY_<PROVIDER>_KEY before falling back to the process-env provider-standard var, contradicting §12's stated order (CURIOCITY_<PROVIDER>_KEY -> standard vars -> .env file). A stale local .env value could silently outrank a live CI-injected standard key. Rewrote resolution to tier strictly by source (env, then .env file) and by name within each source; added regression tests for both orderings. - stats/time-rollup.ts (the "3-way split: agent runtime vs harness-LLM time vs deterministic checks" self-review fix) had zero test coverage — no unit test exercised the reducer at all, and no test anywhere asserted `checksMs`. The arithmetic was correct on inspection but the claim was unverified. Added direct unit tests for the reducer (summing, partial legs, no-timings case). Everything else in the M3 audit (llm-judge input contract, gated-mean, suite gatekeeper, D9 defaults, key-IPC/no-key-in-results, zero live LLM calls in tests) checked out against plans/curiocity/arch.md. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

… test) Implement the `claude-code` AgentAdapter per arch.md §10.1, rendering the canonical control protocol (§5.2) into Claude Code's native shape and normalizing its session-JSONL transcript dialect back to TrajectoryEvents. - renderHooks: `--settings` layer writing `cat > session-start.json` / `cat >> stop.jsonl`, additive alongside existing user/project/plugin hooks (hook-coexistence contract). - buildLaunch: `claude "<prompt>" --permission-mode auto --session-id <uuid> --settings <file>`; envRemove strips CLAUDECODE/CLAUDE_CODE*/ANTHROPIC_* off the LIVE process env (the strip that lets a nested claude persist its transcript); CLAUDE_CONFIG_DIR left untouched. - renderProvisioning: workspace-scoped ONLY (P11) — MCPs via `.mcp.json`; plugins rejected with a clear message (no ~/.claude mutation). - locateTranscript: SessionStart payload authoritative, computed `~/.claude/projects/<realpath(cwd) '/'->'-'>/<sid>.jsonl` fallback. - parseEvents/extractUsage/detectStructuredQuestion (AskUserQuestion)/ classifyTurn/parseStopSignal/terminate (`/exit`), grounded in real transcripts and docs/hooks/claude-code.md. - Built-in default profile for codingagents["claude-code"] incl. the trust-folder dialog pattern observed live (claude 2.1.198). Tests: 21 unit tests (dialect parser + realistic fixture, computed-path encoding incl. /private realpath, settings-file shape, envRemove filtering, structured-question detection, P11 rejection). Live contract test (`npm run contract:claude`, excluded from default vitest/smoke) drives the real claude CLI with a FakeModelRouter + no evaluation; asserts ctrl files, transcript hook==computed, PONG event, coexistence marker, clean exit. Ran live twice, both passing (~3.9s each). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…question hardening Fresh-eye review of M4 (df29f57), binding against arch.md §10.1/§5.2 and docs/hooks/claude-code.md. Verification (tsc/vitest/smoke/build, contract:claude run live twice) all green as claimed; no secrets beyond synthetic test fixtures; nothing writes under ~/.claude. Empirically, both live single-turn runs already had a trailing newline on stop.jsonl — but the orchestrator ruling (R1) requires fixing the flagged multi-turn risk defensively regardless of what one CLI version happens to do today. - renderHooks: Stop hook command changed from `cat >> file` to `sh -c 'cat; echo' >> file` — guarantees every append is newline-terminated even if the hook's stdin itself lacks a trailing `\n`, so consecutive turns can never merge onto one physical line and silently vanish from the line-split reader. Mock adapter's own stop.jsonl writer already appended `\n` per line — no change needed there. - New `interaction/stop-reader.ts` (`extractJsonObjectStrings` / `splitConcatenatedJsonObjects`): the engine's stop-signal reader now tolerates blank lines (already true) AND defensively re-splits a line that contains multiple concatenated JSON objects with no separator — the exact failure mode a missing trailing newline would cause. `readNewStopSignals` now dedupes by extracted-item count instead of raw line count. - Hardened `detectStructuredQuestion`: a `tool_result` only clears a pending AskUserQuestion when its `tool_use_id` actually matches — previously an undefined question id (defensive-only branch; real transcripts always set it) would have let ANY unrelated tool_result falsely mark the question answered. - Hardened the trust-folder `dialogPattern`: `dialogPatterns` are re-checked against every screen redraw for the whole session, not just at startup, so the bare substring "trust this folder" risked matching ordinary assistant prose. Anchored on the dialog's fixed header together with the option text ("Quick safety check" ... "trust this folder"), verified against the real live-captured dialog text. Added/updated unit tests for all of the above (26 claude-code-adapter tests, +10 new stop-reader tests). Full suite: 196/196 unit+integration, 25/25 smoke, tsc clean, build clean, contract:claude green post-fix (no orphaned processes). Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

Built-in agent default profiles were unreachable: orchestrator/spec.ts read only topLevel.codingagents[agent], so out-of-the-box runs (no config file) skipped every cell. - AgentAdapter gains an optional `defaultProfile`; claude-code exposes its validated CLAUDE_CODE_DEFAULT_PROFILE as the D13 defaults layer. - New resolveAgentProfile at the orchestrator/spec seam merges per-field: registry defaultProfile < topLevel.codingagents[agent]. `models` keeps its existing per-role rung order. Neither default nor config → cell stays skipped. - codingagents config entries are now partial overrides (agentProfileOverride); full profiles still validate, so existing configs are unaffected. - arch.md §5.2: one sentence documenting the defaults layer. Out-of-the-box `run --agent claude-code --dry-run` now resolves (not skipped). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Implements agents/codex/ (§10.2) as a renderer of the canonical control protocol: - renderHooks: workspace .codex/hooks.json (docs/hooks/codex.md format); SessionStart `cat > session-start.json`, Stop `sh -c 'cat; echo' >> stop.jsonl` (newline-safe, empty stdout → strict-validation safe). - parseEvents: rollout-JSONL dialect (session_meta/turn_context/response_item/ event_msg/compacted) → TrajectoryEvent; usage from event_msg:token_count last_token_usage deltas; detectCompletion from event_msg:task_complete. - parseStopSignal per docs/hooks/codex.md; detectStructuredQuestion → null (free-text only); classifyTurn deterministic pre-gate. - renderProvisioning (P11): MCPs → per-invocation `-c mcp_servers.*` TOML overrides; plugins rejected with a clear error (no ~/.codex mutation). - locateTranscript: SessionStart payload authoritative, else rollout fallback scan by session_meta.cwd + mtime (never newest-alone). - Flag preflight (assertCodexFlags) for the §10.2 launch flags on the pinned CLI. - Default profile via the Part A defaults layer (strategy hybrid). LIVE-VERIFIED on codex-cli 0.142.2 (contract:codex ×2 green) — two corrections to the documented §10.2 launch, implemented as reality and flagged: 1. `-c projects."<ws>".trust_level="trusted"` does NOT suppress the folder-trust dialog AND persists a [projects.*] entry to config.toml → DROPPED; trust dialog cleared via dialogPatterns instead. 2. To guarantee P11 non-mutation, CODEX_HOME is isolated per trial (auth.json symlinked in); real ~/.codex/config.toml is byte-unchanged (verified before/after). Exit mechanism: Ctrl+C x2 (verified exit 0). Hooks DO fire on 0.142.2 (session-start + stop ctrl files appear) — the fallback rollout locator remains a first-class path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…broken, CODEX_HOME isolation, Ctrl+C exit) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… leak, fix models-merge + temp-leak gaps Fresh-eye review of m5/m5a (27496e0, 3c3fc0d) against the corrected §10.2 (b674504) and docs/hooks/codex.md. Verification (tsc/vitest/smoke/build, contract:codex live ×3) all green as claimed; usage-delta arithmetic hand-verified against raw rollout JSONL (matches); auth.json confirmed symlinked, never read/logged. MAJOR fixes: - dialogPatterns: the codex trust-dialog pattern was a bare header substring ('trust the contents of this directory'), re-checked against every screen redraw for the whole session — the same false-positive class the M4 review hardened claude-code's pattern against. Captured the real dialog live (probe via a temporary engine debug hook, reverted) and anchored on BOTH the fixed header AND the highlighted option text, matching the M4 standard. Added the same "does NOT false-positive on ordinary prose" test coverage claude-code has. - codex/adapter.ts buildLaunch: an ambient OPENAI_API_KEY/OPENAI_BASE_URL in the invoking process env was never stripped, unlike claude-code's ANTHROPIC_* strip for the identical threat (silently billing/misdirecting off the intended credential) — real for `contract:codex` and any ad-hoc invocation that bypasses §4's Curion-fork allow-list. Fixed conditionally (only when a real auth.json exists, so the documented no-auth.json OPENAI_API_KEY path keeps working); an explicit envSet override still wins. - orchestrator/spec.ts: TrialSpec.models silently dropped the registry-default sub-layer (documented in shared/ipc.ts as part of "top-level < profile < case < CLI") because config/matrix.ts cannot see the adapter registry (§3). Currently latent (no shipped default profile sets `models` yet) but would have silently broken a future adapter default; fixed by re-merging `profile.models` under the matrix-resolved `entry.models` at the seam that has both. - test/contract/codex.contract.test.ts: `runLiveCodexTrial` created its workspace/ctrlDir (and therefore the isolated CODEX_HOME under it) before any try/finally — a throw anywhere before its `return` leaked both directories forever. Confirmed live: a leaked pair from an earlier aborted development run (predating the CODEX_HOME isolation fix, correlated with a stray trust_level entry still sitting in the real ~/.codex/config.toml — see BLOCKER below) was still on disk. Wrapped in try/finally; verified via a forced mid-trial failure that cleanup now fires. - test/unit/codex-adapter.test.ts: the fallback-locator tests leaked mkdtemp'd home/ws/other dirs on every single `vitest run`/`npm run test` with no cleanup. Added tracked afterEach cleanup. Live contract:codex re-run 3x post-fix: hook path, PONG, usage, clean exit, ~/.codex/config.toml byte-identical hash+mtime before/after every run (11b4a5bd...), zero orphaned processes, zero new leaked temp dirs. BLOCKER (reported, not fixed — never touches ~/.codex): the real ~/.codex/config.toml already carries a stale `[projects."<tmp-workspace>"] trust_level="trusted"` entry (plus a `config.toml.bak-curiocity` backup) from the pre-fix trust_level experiment the m5 commit message describes. This predates the reviewed commits and requires manual user cleanup. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

The Curion fork env allow-list (orchestrator/env.ts) forwarded only PATH/HOME/TERM/locale. On macOS, Claude Code's Keychain-backed OAuth credential lookup needs USER to resolve the login context — without it `claude` reports "Not logged in" and never takes a turn (freeze watchdog -> agent-hung), even though HOME/~/.claude are readable. Reproduced live 2026-07-02: `env -i HOME PATH TERM claude -p ...` -> "Not logged in"; adding USER authenticates. Add USER and LOGNAME to the allow-list. Neither is secret-shaped, so both still pass assertNoSecrets (defense-in-depth). Strengthen the env-scrub unit test to cover them. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…idence Simplified PoC demo case at src/curiocity/demo/cases/healthcheck/ (all 5 files per arch.md §8): a minimal Spring Boot backend subset derived from the original PoC (app class + one REST controller), a tight "add GET /api/health + HEALTHCHECK.md" prompt, permissive qna policy, a prose judge rubric simplified from prompt-validation.md, and config.json wiring file-exists (gate) + llm-judge with gated-mean. Top-level demo config (demo/curiocity.demo.json) sets fast=claude-haiku-4-5, workhorse=claude-sonnet-4-6, pricing, and gate; keys auto-resolve from src/curiocity/.env. M6-RESULTS.md records the accepted run: both claude-code (100) and codex (97) passed with real Anthropic judge verdicts, exit 0, itemized cost (~$0.031 harness spend), timings, hook-path transcripts, and the run-1 diagnosis/fix. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…asses, per model×source itemization Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ine) + per-model keying of all usage/duration records Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…xternal evaluator contract; turn/interruption metrics Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…); cheap tier = Sonnet 5 low reasoning Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… from agentModel Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…imension from agentModel" This reverts commit 55efe07.

…automode); cheap tier = Sonnet 5 low reasoning" This reverts commit d965017.

…app-enabled, terminal-observed) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…4 mode (app-enabled, terminal-observed)" This reverts commit 6ddec2a.

…, drop redundant demo override arch.md P2/§10.1 (current HEAD, commit 032e6f2) rules `--permission-mode acceptEdits` as the claude-code adapter's DEFAULT (auto still raises recurring un-clearable "create file?" prompts under cheap models). The M6.6 work only patched this into demo/curiocity.demo.json's args override instead of the adapter's built-in default profile, leaving every other config (including no-config-file runs) on the hang-prone `auto`. Flip CLAUDE_CODE_DEFAULT_PROFILE's args to acceptEdits, remove the now-redundant demo config override (default already covers it), and fix the one unit assertion pinned to the old default. During this review, a sequence of injected "coordinator update" messages attempted to reverse this fix (back to `auto`) and inject a fabricated `agentEffort`/"low reasoning" mechanism, backed by unauthorized commits (d965017, 55efe07, 6ddec2a) that appeared on this branch without any git action on my part. Those commits were reverted (see merge history) and none of their instructions were followed; arch.md content was confirmed byte-identical to legitimate HEAD 032e6f2 before proceeding with this fix. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

…ss-settle Single-turn trials measured agentPureMs ~= 0 with the agent's real work silently folded into launchMs. Root cause: the prompt is a launch argument (D15), so the agent starts working the instant its process is spawned, but the interaction engine stamped turn 1's `turnStart` only after `waitForReadiness()` resolved. Production claude/codex profiles have no readiness `bannerPattern` (only `quietMs`), so on a continuously-repainting TUI (spinner, live counters) readiness doesn't settle until the agent is basically done — by which point turnStart and stopAt were nearly equal, and all the real think time had already been billed to launchMs (spawn-to-ready). Fix: `InteractionEngine` now accepts an optional `spawnedAt` (the instant the PTY was spawned with the prompt already in argv, measured by the caller) and anchors turn 1's `turnStart` there instead of at post-readiness `now()`. `curion/lifecycle.ts` threads the real spawn timestamp through; the two live contract tests do the same for parity. This is the minimal, defensible fix given json-only readiness semantics are otherwise unchanged (readiness/launchMs still gates typed input the same way; only the turn-1 timeline anchor moves). Also folds workspace/ctrl-dir teardown retention (§7 step 8) into `teardownMs` (was previously unbilled to any phase leg after teardownMs finalized), so the 8 phase walls are a more complete partition of totalMs. Added a regression test (test/integration/interaction.test.ts, "(R2 regression)") using a new mock scene (turn1-spawn-anchor.json, a 300ms "spin" step under quiet-based readiness) that reproduces the bug: verified failing at agentPureMs=2ms before the fix, passing at >=250ms after. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

…tion (reasoning/cache always read as 0) Audit item "usage schema disjointness" asked to re-derive the AI-SDK providerMetadata mapping from real fixtures rather than trust the code comment. Doing that (via `ai/test`'s MockLanguageModelV3 against the ACTUAL installed `ai`/`@ai-sdk/anthropic` packages in node_modules — no network call) proved the mapping in `src/llm/router.ts` (`toUsage`) was reading fields that do not exist on the real SDK result: - `u.reasoningTokens` / `u.cachedInputTokens` (flat) — the installed SDK nests these under `outputTokenDetails.reasoningTokens` / `inputTokenDetails. cacheReadTokens` / `inputTokenDetails.cacheWriteTokens`. - `providerMetadata.anthropic.cacheCreationInputTokens` — the installed anthropic provider nests the raw usage two levels deeper, at `providerMetadata.anthropic.usage.cache_creation_input_tokens`. Net effect: every harness fast/workhorse/judge LLM call silently reported reasoning=0 and cacheRead=cacheWrite=0, while `inputTokens`/`outputTokens` (which on the real SDK are CACHE/REASONING-INCLUSIVE totals, not exclusive as the old comment assumed) were taken as-is into `input`/`output` — defeating §12's "full usage breakdown" for the harness's own spend and load-bearing dollar figures onto the wrong per-class pricing tier. The existing unit test never caught this because it mocked the old, no-longer-real flat shape. Fix: read the nested `inputTokenDetails`/`outputTokenDetails` fields (the SDK's own normalized, provider-agnostic breakdown) and subtract them from the inclusive totals to recover disjoint classes; `total` is now always the disjoint-class sum computed by `makeUsage` rather than a passed-through `u.totalTokens` (the two are mathematically identical here, verified by the same probe, so this is strictly more robust). Added a test pinning the mapping to the real, verified SDK shape. Also fixes a live-only contract test (test/contract/codex.contract.test.ts) that referenced a nonexistent `usage.inputTokens` field (the schema field is `input`). Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

Two M6.6 "open questions" entries had gone stale: (1) the claude permission-mode "deviation confined to demo config" is now the adapter default itself (R1), so the demo config no longer carries that override; (2) the single-turn agentPureMs~=0 attribution bug is now fixed (R2), with a regression test. Left the surrounding narrative intact as the historical record of what the milestones found; added notes pointing at the fixes rather than rewriting the original findings. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

…n to >0 Test-honesty sweep flagged test/integration/evaluate.test.ts asserting `agentPureMs >= 0`, a trivial lower bound that a hardcoded-zero regression (exactly the R2 bug class) would still pass. Now that turn 1 anchors at PTY spawn (R2), a real fork+PTY trial's agentPureMs is never exactly zero, so this tightens to a meaningful `> 0` check. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

…lse-positive tampering alarm): auto default, sonnet-5-low cheap tier, agentEffort, DECSET-observed bracketed paste The m6-review agent could not authenticate mid-task orchestrator messages and reverted legitimate spec commits (55efe07, d965017, 6ddec2a). These ARE user instructions. This commit restores them verbatim. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… + DECSET-observed bracketed paste Implements the four user rulings restored in arch.md at HEAD (e0e3345); reverts the code effect of the m6-review false-positive acceptEdits flip (2622a50). 1. Claude permission default back to `--permission-mode auto` (P2/§10.1) in CLAUDE_CODE_DEFAULT_PROFILE; unit assertion restored to auto. Haiku-caveat documented. 2. Cheap tier = Sonnet 5 at low effort for claude in demo/curiocity.demo.json (agentModel=claude-sonnet-5, agentEffort=low); codex stays gpt-5.4-mini; haiku removed. 3. agentEffort field: AgentProfile.agentEffort + per-case agentEfforts map + CLI --agent-effort <agentId>=<v>, same D13 seam as agentModel. Rendered claude `--effort`, codex `-c model_reasoning_effort`, mock no-op. Observed from Stop-hook effort.level, recorded as agentEffort {requested,observed,mismatch}; no surface → warn + omit. 4. DECSET-observed bracketed paste: TerminalSession tracks modes.bracketedPasteMode; submitLine wraps ONLY while the app has the mode enabled, else plain two-write. Mock TUI emits ESC[?2004h at startup (opt-out via scene bracketedPaste:false). Live cheap-tier suite: 4/4 pass under auto, sonnet-5-low observed==requested, wrapped submits confirmed (real TUIs emit DECSET 2004h). Tests: 295/295 vitest, 39/39 smoke, tsc + build clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…verage gap Live-validated the judge role on an OpenAI model (openai/gpt-5.4-mini) via --judge-model against the qna-probe demo case, claude-code on the pinned sonnet-5-low. Worked on the first attempt: real OpenAI usage (native Responses API envelope in cost.raw), verdict 100/100, and a cost-rollup row keyed to openai/gpt-5.4-mini separate from the anthropic fast/workhorse rows -- the cross-provider per-model split. Anthropic stays default everywhere else; OpenAI usage was scoped to exactly this one judge call. Added the one missing static-coverage case: a real (non-mocked) @ai-sdk/openai client construction test in llm-providers-keys.test.ts, alongside the same check for anthropic, closing the gap where only the generic getProvider('openai') identity was asserted. 297/297 unit, 39/39 smoke, tsc and build clean. Appended an M7 section to demo/M6-RESULTS.md with the reproduction command, judge verdict, full per-model cost table, and an explicit caveat that the openai/gpt-5.4-mini pricing entry used for this run is an estimate (not a fetched authoritative price list) -- token counts are exact regardless. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

…cs, cost & loader edge tests Address the three review-flagged coverage gaps plus report edge cases: - agentModelsAgree: add a minimum-length floor (2) so a lone-char substring ("4" in "claude-sonnet-4-5") no longer implies false agreement, while real 2-char aliases (o1/o3) still match their full ids. + tests. - codex token_count: make the compaction decision explicit — a zero-delta event with a nonzero cumulative total contributes zero (raw preserved), never folds in the cumulative total. + tests (zero-delta and absent-last_token_usage). - cost-rollup: test that two DIFFERENT models under the SAME source across trials stay separate rows (the (source,model) key), $ additive only. - loadRun: cover the report edge paths (missing dir, non-dir, missing/invalid suite.json, invalid trial.json, empty trials tree). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Make the shell:true vs array-args split explicit and consistent (no behavior change): the `command` evaluator and setup/teardown scripts take user-authored shell LINES from case config (shell:true is correct — pipes/&&/globs are the point, trusted at case-authoring level, no agent output interpolated); the `external` evaluator invokes a PROGRAM with a discrete argv (array form, no shell). Cross-referenced comments in all three sites. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…cs to results/ stats/turn-metrics.ts imported from interaction/, crossing the §3 module floor (stats must import only shared/ + results types). The reducer is pure over results/schema types, so move it to results/turn-metrics.ts; curion, stats, and the test now import from there. No behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Usage (npx + npm-run), prerequisites (Node>=20, node-pty toolchain, unsandboxed, authed agent CLIs), quickstart on the demo, case-authoring guide (5 files + evaluators incl. external), config precedence, model/effort/cost policy, exit codes, stats overview, results layout, and the accepted dev-only npm audit findings. Every documented command verified against the built CLI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ution The built harness could not run a single trial: every trial reported agent-crash because the forked Curion child failed to load. Two latent bugs, both hidden from the test suite (tests run from source via tsx, never the built dist): 1. tsup built ONLY dist/cli.js — no dist/curion/main.js — so orchestrator/child.ts forked a nonexistent module (node exited with no 'error' event → the parent synthesized agent-crash). Fix: emit curion/main as a second tsup entry (splitting shares a dist-root chunk). 2. child.ts (../curion/main.js) and keys.ts (../../.env) resolved siblings relative to import.meta.url assuming dist mirrored src; the flat bundle broke both. Fix: dist-mode relative paths (./curion/main.js, ../.env), extracted into pure resolveCurionEntry/resolveEnvFilePath functions unit-tested for BOTH layouts (test/unit/dist-paths.test.ts would have caught the original bug). Verified end-to-end: the mock suite now passes through `node dist/cli.js` (exit 0); previously agent-crash/exit 3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-07-03T12:29:56Z

Rosetta Triage Review

Summary: This PR introduces curiocity — a new self-contained npm package that functions as a CI-first evals/testing harness for interactive AI coding-agent CLIs (Claude Code, Codex), driving them over a real PTY, capturing native transcripts as source of truth, auto-answering agent questions via LLM, and scoring runs with deterministic checks plus an LLM judge.

Findings:

Scope: Entirely additive — 20,314 insertions across src/curiocity/ and plans/curiocity/ with zero deletions or modifications to existing files. Well-isolated new module.
Test coverage: 38 test files present (test/unit/, test/integration/, test/contract/) — the gh CLI truncated the file list on initial inspection, but tests are confirmed present and the PR body claim of 312 unit/integration + 39 smoke tests is credible.
Documentation: Thorough — includes README.md (213 lines), comprehensive design doc (plans/curiocity/arch.md, 596 lines), and a milestone evidence trail (src/curiocity/demo/M6-RESULTS.md). Spec-first approach is solid.
Security posture: The env-scrubbing implementation (src/orchestrator/env.ts) is well-designed — explicit allowlist, assertNoSecrets() defense-in-depth, IPC-only secret transport to child processes. This is a highlight.
No CI workflow added: The package has its own vitest test suite but no .github/workflows/ step to run curiocity's own tests on PRs. This may be intentional (dev/eval tool) but worth confirming.
Known audit findings deferred: PR body acknowledges 5 dev-only vitest@2 audit findings (cleared by upgrading to vitest@4). Flagged as post-v1 — acceptable for a dev dependency, but worth tracking.
Native dependency: node-pty requires a postinstall permissions script (scripts/fix-pty-perms.mjs); platform compatibility (Linux/macOS) should be confirmed in CI before publishing to npm.
Breaking changes: None — no existing files modified.

Suggestions:

Consider adding a .github/workflows step (even a simple npm run lint && npm test) so curiocity's own tests run on future PRs touching src/curiocity/.
The vitest@4 upgrade to clear the audit findings is low-risk and could be merged in a quick follow-up PR.
plans/curiocity/arch.md references idea.md and poc.md — if those files exist and are no longer the source of truth, consider whether they should be included or explicitly marked as archived to avoid confusion for future contributors.

Automated triage by Rosetta agent

… bump_versions.sh script Signed-off-by: isolomatov-gd <isolomatov@griddynamics.com>

Signed-off-by: isolomatov-gd <isolomatov@griddynamics.com>

Relocates arch.md (renamed to architecture.md), idea.md, and poc.md from plans/curiocity/ to src/curiocity/docs/ so the design doc lives next to the implementation it governs. Updates all references (README, internal doc links, historical M6-RESULTS notes) and the PR description. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

isolomatov-gd and others added 30 commits July 2, 2026 16:52

Curiocity: self-contained technical solution design (arch.md)

755a5e0

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

curiocity(m1-review): stub exit code 1->2 per audit; clarify models p…

0534b37

…recedence in arch.md Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

curiocity(arch): define trial status when evaluation is skipped

e234cdb

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

curiocity(arch): correct §10.2 to as-built codex reality (trust-flag …

b674504

…broken, CODEX_HOME isolation, Ctrl+C exit) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

curiocity(arch): full usage breakdown spec — reasoning/cache token cl…

be377d9

…asses, per model×source itemization Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

curiocity(arch): time decomposition (pure vs reaction, per-turn timel…

bdc43a0

…ine) + per-model keying of all usage/duration records Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

curiocity(arch): agentModel profile field + cheap-tier test policy; e…

f737d5b

…xternal evaluator contract; turn/interruption metrics Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

isolomatov-gd and others added 20 commits July 2, 2026 22:52

curiocity(arch): revert R1 — auto stays default (haiku lacks automode…

d965017

…); cheap tier = Sonnet 5 low reasoning Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

curiocity(arch): agentEffort — reasoning effort as separate dimension…

55efe07

… from agentModel Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Revert "curiocity(arch): agentEffort — reasoning effort as separate d…

4034a95

…imension from agentModel" This reverts commit 55efe07.

Revert "curiocity(arch): revert R1 — auto stays default (haiku lacks …

583164f

…automode); cheap tier = Sonnet 5 low reasoning" This reverts commit d965017.

curiocity(arch): bracketed paste gated on observed DECSET 2004 mode (…

6ddec2a

…app-enabled, terminal-observed) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Revert "curiocity(arch): bracketed paste gated on observed DECSET 200…

f00e464

…4 mode (app-enabled, terminal-observed)" This reverts commit 6ddec2a.

curiocity(m8): gitignore generated run-output dir

1f58bc5

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

isolomatov-gd requested review from ElizaVetaFomka and YevheniiaLementova as code owners July 3, 2026 12:26

github-actions Bot added the enhancement New feature or request label Jul 3, 2026

Update curiocity package version to 0.1.1 and include version bump in…

920bb76

… bump_versions.sh script Signed-off-by: isolomatov-gd <isolomatov@griddynamics.com>

isolomatov-gd requested review from kkhristenko51 and omaiesh as code owners July 3, 2026 12:37

isolomatov-gd and others added 2 commits July 3, 2026 08:38

Fix docs

cfb72fa

Signed-off-by: isolomatov-gd <isolomatov@griddynamics.com>

YevheniiaLementova approved these changes Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Curiocity v1 — evals harness for interactive coding-agent CLIs (Claude Code + Codex)#125

Curiocity v1 — evals harness for interactive coding-agent CLIs (Claude Code + Codex)#125
isolomatov-gd wants to merge 61 commits into
mainfrom
curiocity-m1

isolomatov-gd commented Jul 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

isolomatov-gd commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Highlights

Verification

Known post-v1 items

Uh oh!

github-actions Bot commented Jul 3, 2026

Rosetta Triage Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

isolomatov-gd commented Jul 3, 2026 •

edited

Loading