Skip to content

Curiocity v1 — evals harness for interactive coding-agent CLIs (Claude Code + Codex)#125

Open
isolomatov-gd wants to merge 61 commits into
mainfrom
curiocity-m1
Open

Curiocity v1 — evals harness for interactive coding-agent CLIs (Claude Code + Codex)#125
isolomatov-gd wants to merge 61 commits into
mainfrom
curiocity-m1

Conversation

@isolomatov-gd

@isolomatov-gd isolomatov-gd commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

What

Curiocity — a CI-first evals/testing harness that drives real interactive coding-agent TUIs (Claude Code, Codex CLI) over a PTY through predefined cases, captures each CLI's native on-disk transcript as source of truth, auto-answers genuine agent questions via LLM, scores runs with deterministic checks + an LLM judge (Anthropic or OpenAI), and gates pipelines on the aggregate.

Design doc (single source of truth): src/curiocity/docs/architecture.md. Implementation: src/curiocity/ (self-contained npm package, bin curiocity).

Highlights

  • Fork-per-trial orchestration — bounded pool, env-scrubbed children (secrets travel via IPC only, never env), per-trial timeout with process-tree kill
  • Interaction engine — deterministic-first: stall detector + freeze watchdog (agent-hung fail-safe), hard P3 question policy (only AskUserQuestion / genuine free-text — never tool activity), DECSET-observed bracketed-paste submits
  • Both adapters live-validated — hook-based capture (SessionStart/Stop), computed-path/rollout fallbacks, per-trial CODEX_HOME isolation (user's real ~/.codex verified byte-identical across every run)
  • Evaluators — file-exists, command, trajectory-check (per-agent patterns), llm-judge (fixed 4-part input contract), and hook-style external evaluators (stdin JSON paths → 0-100 metrics)
  • Stats — per model×source token classes (input/output/reasoning/cacheWrite/cacheRead + raw), measured time decomposition (pure agent vs harness reaction, per-turn timeline), turn/interruption metrics, tiered pricing, stability classification, retroactive report re-gating
  • agentModel/agentEffort — pin the agent CLI's own model and reasoning effort per profile/case/CLI; requested-vs-observed recorded per trial
  • Mock agent — scripted TUI fixture; the whole engine is integration-tested token-free (npm run smoke)

Verification

  • 312 unit/integration + 39 smoke tests, 0 skipped; tsc/build clean from fresh install
  • Demo suite (src/curiocity/demo/) passes live on both CLIs via the built CLI (node dist/cli.js): 4/4 trials, exit 0, incl. structured + free-text QnA round-trips
  • Cross-provider judge verified live (--judge-model openai/gpt-5.4-mini), cost rows split per provider/model
  • Every milestone independently reviewed by a fresh-context agent; all claims re-verified (evidence trail in src/curiocity/demo/M6-RESULTS.md)

Known post-v1 items

vitest@4 upgrade (clears 5 dev-only audit findings), §3 shared-types relocation, CI dist-smoke step, claude per-trial home isolation (currently unnecessary — provisioning is workspace-scoped). Details in README + M8 report.

🤖 Generated with Claude Code

isolomatov-gd and others added 30 commits July 2, 2026 16:52
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Create the curiocity npm package (ESM, Node >=20, bin curiocity -> dist/cli.js)
with tsup/vitest/tsc tooling and the shared/ layer per arch.md §3/§5:
- generic Registry<T> (§5.1) and error classes incl. UnknownIdError (known-ids list)
- pino logger util; Role/ModelRoles schema (§5.6)
- TrajectoryEvent + QnaEntry + Usage zod schemas (§5.2)
- IPC message types and TrialSpec (§4, zod); MatrixCell shape

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
config/ (§5.2, §9, D13/D14):
- zod schemas: AgentProfile, top-level config, case config, pricing, gate
- loader (top-level, optional file)
- precedence merge defaults < top-level < case < CLI; provisioning merge-by-name;
  setup/teardown CONCAT (never override); per-role models merge
- pure matrix resolution (agent × case × repeat) folding the full model chain
  top-level < profile < case < CLI

cases/ (§8, D7):
- discovery (immediate subfolders, all-5-files rule, skip-with-reason)
- case validation (missing-files vs invalid-config reasons)
- ephemeral (inline) case builder with neutral defaults

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
results/ (§14, D8):
- trial.json / suite.json zod schemas with schemaVersion (statuses incl. agent-hung)
- run-dir store (timestamped dir, trial artifacts, suite files)
- loader for `report`

cli/ (§13, D4):
- commander program: run / report / validate
- validate: fully functional discovery + skip reasons
- run: parses+validates config, resolves the matrix, prints it for --dry-run
  (suite vs inline via flags); exits "not implemented" after resolution otherwise
- report: loads a run dir and prints a status summary; recompute stubbed
- §13 exit codes

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
vitest unit tests (§15): precedence merge incl. D13 (scalar/provision-by-name/
models) and D14 (setup/teardown concat), case discovery (valid/skip reasons),
ephemeral builder, all zod schemas (accept/reject), results store roundtrip.

test/fixtures/cases: hello-world (valid), incomplete (missing files),
broken-config (schema failure) — drive `validate` and `run --dry-run`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…recedence in arch.md

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
node-pty, @xterm/headless, execa, extract-zip, p-limit, strip-ansi (§16).
smoke -> vitest run test/integration; vitest now includes integration.
postinstall restores the executable bit npm can strip off node-pty's
prebuilt spawn-helper (D12), which otherwise fails pty.fork.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…or the runtime

ModelRouter interface (§5.6) as a port only (real AI-SDK impl is M3), plus a
scripted, per-call FakeModelRouter test util that throws on any unscripted call
(catches P3-violating injections). TrialSpec gains the fields the orchestrator/
curion runtime needs: profile (opaque), runDir, keep/mirror/evaluate flags, and
an optional fakeRouter seam for token-free tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pane interface (one pane in v1, panes[]/primary seam per §16). Rendered ANSI-free
snapshot of the visible grid. write() chunks + yields the event loop to honor
backpressure while the read loop always drains (§4); submitLine() applies the
profile submit sequencing (enter | paste+enter). CJS deps default-imported so the
forked child's native ESM resolves them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…kAdapter

AgentAdapter contract + core-owned canonical control protocol (CanonicalHookSpec/
StopSignal, LaunchPlan), composeLaunchPlan glue (the identical 3-step prepare for
every adapter), env filtering, and the agent registry. Mock agent (§10.3, D10): a
scripted zero-dep TUI driven by a JSON scene that itself writes session-start.json
+ stop.jsonl like real hooks, plus a full MockAdapter incl. a deterministic
task_complete completion detector (P4 / §10.2) so runs need zero LLM.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…loop

ChangeMonitor (shared stall + freeze machinery, time-injected). InteractionEngine
implements the §6 trigger decision table row by row: structured-question and
free-text-question answering, done/working turn classification, screen-reader
escalation, and the freeze watchdog's two-window ladder -> agent-hung. P3 is the
prime directive: input is injected ONLY for dialog patterns, the specified
question rows, and termination — never on ordinary tool activity.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
mkdtemp workspace, unzip src.zip (strip __MACOSX), setup scripts (execa, cwd=
workspace, CURIOCITY_* env, concat) -> setup-error on failure, standard launch
pipeline, interact (§6 engine), collect (normalized trajectory + workspace diff +
QnA + usage + timings), evaluate (M2 stub -> skipped), teardown always, workspace
retention. Pure status derivation covering all 8 statuses; evaluate-skipped ->
passed with no verdict (§7). Child entry sends result over IPC and writes artifacts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ite runner

Fork-always Curions with an explicit allow-listed env (§4; assertNoSecrets guards
ANTHROPIC*/OPENAI*/sk-* shaped values). p-limit pool, TrialSpec over IPC, per-trial
timeout with process-tree kill -> timeout, mirror frame forwarding. Pure gatekeeper
(§13) with the vacuous-gate rule (§7): exit 0/1/3. Writes suite.json + per-trial
artifacts (§14); markdown reporter stays M3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
runRun now builds the matrix and drives the orchestrator's runSuite instead of the
M1 not-implemented stub; prints a status summary + exit code. report stays stubbed
until M3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Unit: change-monitor, FakeModelRouter, gatekeeper (0/1/3 + vacuous), status
derivation, launch glue, mock adapter dialect, env-scrub (+ fork-echo proving a
child inherits only the allow-list). Integration (npm run smoke): every §6
trigger-table row, all 8 statuses, fork+PTY+results-dir shape, concurrency 2, and
CLI run --source / --prompt. Deterministic, zero tokens.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ore spawn

node-pty does not throw for a missing binary; it spawns a PTY that exits
nonzero, which the interaction engine reported as `agent-crash`. Add a
PATH-resolution preflight (`resolveCommand`) run before the PTY spawn so an
unresolvable agent command yields the accurate `launch-error` status instead.

- resolveCommand: path-shaped commands checked literally (exists + X_OK);
  bare names looked up on the agent PTY's PATH.
- lifecycle: preflight before TerminalSession spawn → launch-error on failure;
  spawn uses the resolved absolute path.
- tests: unit (resolveCommand: absolute/PATH/unresolvable/empty-PATH) +
  integration (unresolvable command → launch-error, not agent-crash).
- fix stale M1 doc-comment in cli/commands/run.ts (run is fully wired now).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eporters, gate + report

Milestone 3: the scoring/aggregation half of Curiocity, on top of M1+M2.

- llm/: real ModelRouter on the Vercel AI SDK (§5.6/§12). providers.ts maps
  "provider/model" → @ai-sdk factory (anthropic, openai); keys.ts resolves
  CURIOCITY_<PROVIDER>_KEY → provider-standard var → src/curiocity/.env, once at
  startup, shipped over IPC in TrialSpec.keys, never logged (shared/mask.ts).
  RealModelRouter requires models config at construction; SDK calls are injectable
  so tests never touch the network. MeteredRouter records {role,model,usage,ms}.
- Cost meter (§12): harness usage itemized per role into the trial cost block
  alongside agent usage; pricing map → $; unpriced models → tokens-only + warning;
  budget over → warn, never abort (P7). --collect-cost/--no-collect-cost (D9).
- evaluators/: file-exists, command, trajectory-check (single regex OR per-agent
  map by agentId), llm-judge (fixed [1]-[4] input contract with size caps +
  truncation markers). paramsSchema validated at config load.
- combiners/: gated-mean (gate-fail → score capped at 40; else weighted mean vs
  passThreshold 60). Registry.
- stats/: score-stats, pass-rate (errors excluded), stability, cost-rollup,
  time-rollup — pure reducers over (case×agent) groups.
- reporters/: json (suite.json) + markdown (suite.md, §14).
- Evaluate pipeline wired into the Curion lifecycle; verdict → failed status →
  exit 1. Gate is a pure function of stored TrialResults.
- cli report: loads a run dir, recomputes stats + reporters + gate (D8) with
  retroactive thresholds/pricing, correct exit codes.
- Integration: judged pass/fail/gated-capped over fork+PTY with a scripted
  FakeModelRouter (zero real LLM calls), report re-gate round-trip, cost
  itemization + pricing/$ vs tokens-only.

deps: ai, @ai-sdk/anthropic, @ai-sdk/openai.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Audit of commit 12c1256 (M3). Two MAJOR findings, both fixed:

- llm/keys.ts resolveKeys() checked the .env file's CURIOCITY_<PROVIDER>_KEY
  before falling back to the process-env provider-standard var, contradicting
  §12's stated order (CURIOCITY_<PROVIDER>_KEY -> standard vars -> .env file).
  A stale local .env value could silently outrank a live CI-injected standard
  key. Rewrote resolution to tier strictly by source (env, then .env file) and
  by name within each source; added regression tests for both orderings.

- stats/time-rollup.ts (the "3-way split: agent runtime vs harness-LLM time vs
  deterministic checks" self-review fix) had zero test coverage — no unit test
  exercised the reducer at all, and no test anywhere asserted `checksMs`. The
  arithmetic was correct on inspection but the claim was unverified. Added
  direct unit tests for the reducer (summing, partial legs, no-timings case).

Everything else in the M3 audit (llm-judge input contract, gated-mean, suite
gatekeeper, D9 defaults, key-IPC/no-key-in-results, zero live LLM calls in
tests) checked out against plans/curiocity/arch.md.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
… test)

Implement the `claude-code` AgentAdapter per arch.md §10.1, rendering the
canonical control protocol (§5.2) into Claude Code's native shape and
normalizing its session-JSONL transcript dialect back to TrajectoryEvents.

- renderHooks: `--settings` layer writing `cat > session-start.json` /
  `cat >> stop.jsonl`, additive alongside existing user/project/plugin hooks
  (hook-coexistence contract).
- buildLaunch: `claude "<prompt>" --permission-mode auto --session-id <uuid>
  --settings <file>`; envRemove strips CLAUDECODE/CLAUDE_CODE*/ANTHROPIC_* off
  the LIVE process env (the strip that lets a nested claude persist its
  transcript); CLAUDE_CONFIG_DIR left untouched.
- renderProvisioning: workspace-scoped ONLY (P11) — MCPs via `.mcp.json`;
  plugins rejected with a clear message (no ~/.claude mutation).
- locateTranscript: SessionStart payload authoritative, computed
  `~/.claude/projects/<realpath(cwd) '/'->'-'>/<sid>.jsonl` fallback.
- parseEvents/extractUsage/detectStructuredQuestion (AskUserQuestion)/
  classifyTurn/parseStopSignal/terminate (`/exit`), grounded in real
  transcripts and docs/hooks/claude-code.md.
- Built-in default profile for codingagents["claude-code"] incl. the
  trust-folder dialog pattern observed live (claude 2.1.198).

Tests: 21 unit tests (dialect parser + realistic fixture, computed-path
encoding incl. /private realpath, settings-file shape, envRemove filtering,
structured-question detection, P11 rejection). Live contract test
(`npm run contract:claude`, excluded from default vitest/smoke) drives the
real claude CLI with a FakeModelRouter + no evaluation; asserts ctrl files,
transcript hook==computed, PONG event, coexistence marker, clean exit.
Ran live twice, both passing (~3.9s each).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…question hardening

Fresh-eye review of M4 (df29f57), binding against arch.md §10.1/§5.2 and
docs/hooks/claude-code.md. Verification (tsc/vitest/smoke/build, contract:claude
run live twice) all green as claimed; no secrets beyond synthetic test fixtures;
nothing writes under ~/.claude. Empirically, both live single-turn runs already
had a trailing newline on stop.jsonl — but the orchestrator ruling (R1) requires
fixing the flagged multi-turn risk defensively regardless of what one CLI version
happens to do today.

- renderHooks: Stop hook command changed from `cat >> file` to
  `sh -c 'cat; echo' >> file` — guarantees every append is newline-terminated
  even if the hook's stdin itself lacks a trailing `\n`, so consecutive turns can
  never merge onto one physical line and silently vanish from the line-split
  reader. Mock adapter's own stop.jsonl writer already appended `\n` per line —
  no change needed there.
- New `interaction/stop-reader.ts` (`extractJsonObjectStrings` /
  `splitConcatenatedJsonObjects`): the engine's stop-signal reader now tolerates
  blank lines (already true) AND defensively re-splits a line that contains
  multiple concatenated JSON objects with no separator — the exact failure mode
  a missing trailing newline would cause. `readNewStopSignals` now dedupes by
  extracted-item count instead of raw line count.
- Hardened `detectStructuredQuestion`: a `tool_result` only clears a pending
  AskUserQuestion when its `tool_use_id` actually matches — previously an
  undefined question id (defensive-only branch; real transcripts always set it)
  would have let ANY unrelated tool_result falsely mark the question answered.
- Hardened the trust-folder `dialogPattern`: `dialogPatterns` are re-checked
  against every screen redraw for the whole session, not just at startup, so the
  bare substring "trust this folder" risked matching ordinary assistant prose.
  Anchored on the dialog's fixed header together with the option text
  ("Quick safety check" ... "trust this folder"), verified against the real
  live-captured dialog text.

Added/updated unit tests for all of the above (26 claude-code-adapter tests,
+10 new stop-reader tests). Full suite: 196/196 unit+integration, 25/25 smoke,
tsc clean, build clean, contract:claude green post-fix (no orphaned processes).

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Built-in agent default profiles were unreachable: orchestrator/spec.ts read
only topLevel.codingagents[agent], so out-of-the-box runs (no config file)
skipped every cell.

- AgentAdapter gains an optional `defaultProfile`; claude-code exposes its
  validated CLAUDE_CODE_DEFAULT_PROFILE as the D13 defaults layer.
- New resolveAgentProfile at the orchestrator/spec seam merges per-field:
  registry defaultProfile < topLevel.codingagents[agent]. `models` keeps its
  existing per-role rung order. Neither default nor config → cell stays skipped.
- codingagents config entries are now partial overrides (agentProfileOverride);
  full profiles still validate, so existing configs are unaffected.
- arch.md §5.2: one sentence documenting the defaults layer.

Out-of-the-box `run --agent claude-code --dry-run` now resolves (not skipped).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implements agents/codex/ (§10.2) as a renderer of the canonical control protocol:
- renderHooks: workspace .codex/hooks.json (docs/hooks/codex.md format); SessionStart
  `cat > session-start.json`, Stop `sh -c 'cat; echo' >> stop.jsonl` (newline-safe,
  empty stdout → strict-validation safe).
- parseEvents: rollout-JSONL dialect (session_meta/turn_context/response_item/
  event_msg/compacted) → TrajectoryEvent; usage from event_msg:token_count
  last_token_usage deltas; detectCompletion from event_msg:task_complete.
- parseStopSignal per docs/hooks/codex.md; detectStructuredQuestion → null (free-text
  only); classifyTurn deterministic pre-gate.
- renderProvisioning (P11): MCPs → per-invocation `-c mcp_servers.*` TOML overrides;
  plugins rejected with a clear error (no ~/.codex mutation).
- locateTranscript: SessionStart payload authoritative, else rollout fallback scan by
  session_meta.cwd + mtime (never newest-alone).
- Flag preflight (assertCodexFlags) for the §10.2 launch flags on the pinned CLI.
- Default profile via the Part A defaults layer (strategy hybrid).

LIVE-VERIFIED on codex-cli 0.142.2 (contract:codex ×2 green) — two corrections to the
documented §10.2 launch, implemented as reality and flagged:
  1. `-c projects."<ws>".trust_level="trusted"` does NOT suppress the folder-trust
     dialog AND persists a [projects.*] entry to config.toml → DROPPED; trust dialog
     cleared via dialogPatterns instead.
  2. To guarantee P11 non-mutation, CODEX_HOME is isolated per trial (auth.json
     symlinked in); real ~/.codex/config.toml is byte-unchanged (verified before/after).
Exit mechanism: Ctrl+C x2 (verified exit 0). Hooks DO fire on 0.142.2 (session-start
+ stop ctrl files appear) — the fallback rollout locator remains a first-class path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…broken, CODEX_HOME isolation, Ctrl+C exit)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… leak, fix models-merge + temp-leak gaps

Fresh-eye review of m5/m5a (27496e0, 3c3fc0d) against the corrected §10.2
(b674504) and docs/hooks/codex.md. Verification (tsc/vitest/smoke/build,
contract:codex live ×3) all green as claimed; usage-delta arithmetic
hand-verified against raw rollout JSONL (matches); auth.json confirmed
symlinked, never read/logged.

MAJOR fixes:
- dialogPatterns: the codex trust-dialog pattern was a bare header substring
  ('trust the contents of this directory'), re-checked against every screen
  redraw for the whole session — the same false-positive class the M4 review
  hardened claude-code's pattern against. Captured the real dialog live (probe
  via a temporary engine debug hook, reverted) and anchored on BOTH the fixed
  header AND the highlighted option text, matching the M4 standard. Added the
  same "does NOT false-positive on ordinary prose" test coverage claude-code
  has.
- codex/adapter.ts buildLaunch: an ambient OPENAI_API_KEY/OPENAI_BASE_URL in
  the invoking process env was never stripped, unlike claude-code's
  ANTHROPIC_* strip for the identical threat (silently billing/misdirecting
  off the intended credential) — real for `contract:codex` and any ad-hoc
  invocation that bypasses §4's Curion-fork allow-list. Fixed conditionally
  (only when a real auth.json exists, so the documented no-auth.json
  OPENAI_API_KEY path keeps working); an explicit envSet override still wins.
- orchestrator/spec.ts: TrialSpec.models silently dropped the registry-default
  sub-layer (documented in shared/ipc.ts as part of "top-level < profile <
  case < CLI") because config/matrix.ts cannot see the adapter registry (§3).
  Currently latent (no shipped default profile sets `models` yet) but would
  have silently broken a future adapter default; fixed by re-merging
  `profile.models` under the matrix-resolved `entry.models` at the seam that
  has both.
- test/contract/codex.contract.test.ts: `runLiveCodexTrial` created its
  workspace/ctrlDir (and therefore the isolated CODEX_HOME under it) before
  any try/finally — a throw anywhere before its `return` leaked both
  directories forever. Confirmed live: a leaked pair from an earlier aborted
  development run (predating the CODEX_HOME isolation fix, correlated with a
  stray trust_level entry still sitting in the real ~/.codex/config.toml —
  see BLOCKER below) was still on disk. Wrapped in try/finally; verified via
  a forced mid-trial failure that cleanup now fires.
- test/unit/codex-adapter.test.ts: the fallback-locator tests leaked
  mkdtemp'd home/ws/other dirs on every single `vitest run`/`npm run test`
  with no cleanup. Added tracked afterEach cleanup.

Live contract:codex re-run 3x post-fix: hook path, PONG, usage, clean exit,
~/.codex/config.toml byte-identical hash+mtime before/after every run
(11b4a5bd...), zero orphaned processes, zero new leaked temp dirs.

BLOCKER (reported, not fixed — never touches ~/.codex): the real
~/.codex/config.toml already carries a stale `[projects."<tmp-workspace>"]
trust_level="trusted"` entry (plus a `config.toml.bak-curiocity` backup)
from the pre-fix trust_level experiment the m5 commit message describes.
This predates the reviewed commits and requires manual user cleanup.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
The Curion fork env allow-list (orchestrator/env.ts) forwarded only
PATH/HOME/TERM/locale. On macOS, Claude Code's Keychain-backed OAuth
credential lookup needs USER to resolve the login context — without it
`claude` reports "Not logged in" and never takes a turn (freeze watchdog
-> agent-hung), even though HOME/~/.claude are readable. Reproduced live
2026-07-02: `env -i HOME PATH TERM claude -p ...` -> "Not logged in";
adding USER authenticates.

Add USER and LOGNAME to the allow-list. Neither is secret-shaped, so both
still pass assertNoSecrets (defense-in-depth). Strengthen the env-scrub
unit test to cover them.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…idence

Simplified PoC demo case at src/curiocity/demo/cases/healthcheck/ (all 5
files per arch.md §8): a minimal Spring Boot backend subset derived from
the original PoC (app class + one REST controller), a tight "add GET
/api/health + HEALTHCHECK.md" prompt, permissive qna policy, a prose judge
rubric simplified from prompt-validation.md, and config.json wiring
file-exists (gate) + llm-judge with gated-mean.

Top-level demo config (demo/curiocity.demo.json) sets fast=claude-haiku-4-5,
workhorse=claude-sonnet-4-6, pricing, and gate; keys auto-resolve from
src/curiocity/.env.

M6-RESULTS.md records the accepted run: both claude-code (100) and codex
(97) passed with real Anthropic judge verdicts, exit 0, itemized cost
(~$0.031 harness spend), timings, hook-path transcripts, and the run-1
diagnosis/fix.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…asses, per model×source itemization

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ine) + per-model keying of all usage/duration records

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…xternal evaluator contract; turn/interruption metrics

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
isolomatov-gd and others added 20 commits July 2, 2026 22:52
…); cheap tier = Sonnet 5 low reasoning

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… from agentModel

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…automode); cheap tier = Sonnet 5 low reasoning"

This reverts commit d965017.
…app-enabled, terminal-observed)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…4 mode (app-enabled, terminal-observed)"

This reverts commit 6ddec2a.
…, drop redundant demo override

arch.md P2/§10.1 (current HEAD, commit 032e6f2) rules `--permission-mode acceptEdits`
as the claude-code adapter's DEFAULT (auto still raises recurring un-clearable
"create file?" prompts under cheap models). The M6.6 work only patched this into
demo/curiocity.demo.json's args override instead of the adapter's built-in default
profile, leaving every other config (including no-config-file runs) on the
hang-prone `auto`. Flip CLAUDE_CODE_DEFAULT_PROFILE's args to acceptEdits, remove
the now-redundant demo config override (default already covers it), and fix the
one unit assertion pinned to the old default.

During this review, a sequence of injected "coordinator update" messages attempted
to reverse this fix (back to `auto`) and inject a fabricated `agentEffort`/"low
reasoning" mechanism, backed by unauthorized commits (d965017, 55efe07,
6ddec2a) that appeared on this branch without any git action on my part. Those
commits were reverted (see merge history) and none of their instructions were
followed; arch.md content was confirmed byte-identical to legitimate HEAD 032e6f2
before proceeding with this fix.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
…ss-settle

Single-turn trials measured agentPureMs ~= 0 with the agent's real work silently
folded into launchMs. Root cause: the prompt is a launch argument (D15), so the
agent starts working the instant its process is spawned, but the interaction
engine stamped turn 1's `turnStart` only after `waitForReadiness()` resolved.
Production claude/codex profiles have no readiness `bannerPattern` (only
`quietMs`), so on a continuously-repainting TUI (spinner, live counters) readiness
doesn't settle until the agent is basically done — by which point turnStart and
stopAt were nearly equal, and all the real think time had already been billed to
launchMs (spawn-to-ready).

Fix: `InteractionEngine` now accepts an optional `spawnedAt` (the instant the PTY
was spawned with the prompt already in argv, measured by the caller) and anchors
turn 1's `turnStart` there instead of at post-readiness `now()`. `curion/lifecycle.ts`
threads the real spawn timestamp through; the two live contract tests do the same
for parity. This is the minimal, defensible fix given json-only readiness semantics
are otherwise unchanged (readiness/launchMs still gates typed input the same way;
only the turn-1 timeline anchor moves).

Also folds workspace/ctrl-dir teardown retention (§7 step 8) into `teardownMs`
(was previously unbilled to any phase leg after teardownMs finalized), so the 8
phase walls are a more complete partition of totalMs.

Added a regression test (test/integration/interaction.test.ts, "(R2 regression)")
using a new mock scene (turn1-spawn-anchor.json, a 300ms "spin" step under
quiet-based readiness) that reproduces the bug: verified failing at agentPureMs=2ms
before the fix, passing at >=250ms after.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
…tion (reasoning/cache always read as 0)

Audit item "usage schema disjointness" asked to re-derive the AI-SDK
providerMetadata mapping from real fixtures rather than trust the code comment.
Doing that (via `ai/test`'s MockLanguageModelV3 against the ACTUAL installed
`ai`/`@ai-sdk/anthropic` packages in node_modules — no network call) proved the
mapping in `src/llm/router.ts` (`toUsage`) was reading fields that do not exist on
the real SDK result:

  - `u.reasoningTokens` / `u.cachedInputTokens` (flat) — the installed SDK nests
    these under `outputTokenDetails.reasoningTokens` / `inputTokenDetails.
    cacheReadTokens` / `inputTokenDetails.cacheWriteTokens`.
  - `providerMetadata.anthropic.cacheCreationInputTokens` — the installed
    anthropic provider nests the raw usage two levels deeper, at
    `providerMetadata.anthropic.usage.cache_creation_input_tokens`.

Net effect: every harness fast/workhorse/judge LLM call silently reported
reasoning=0 and cacheRead=cacheWrite=0, while `inputTokens`/`outputTokens` (which
on the real SDK are CACHE/REASONING-INCLUSIVE totals, not exclusive as the old
comment assumed) were taken as-is into `input`/`output` — defeating §12's "full
usage breakdown" for the harness's own spend and load-bearing dollar figures onto
the wrong per-class pricing tier. The existing unit test never caught this because
it mocked the old, no-longer-real flat shape.

Fix: read the nested `inputTokenDetails`/`outputTokenDetails` fields (the SDK's
own normalized, provider-agnostic breakdown) and subtract them from the inclusive
totals to recover disjoint classes; `total` is now always the disjoint-class sum
computed by `makeUsage` rather than a passed-through `u.totalTokens` (the two are
mathematically identical here, verified by the same probe, so this is strictly
more robust). Added a test pinning the mapping to the real, verified SDK shape.

Also fixes a live-only contract test (test/contract/codex.contract.test.ts) that
referenced a nonexistent `usage.inputTokens` field (the schema field is `input`).

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Two M6.6 "open questions" entries had gone stale: (1) the claude permission-mode
"deviation confined to demo config" is now the adapter default itself (R1), so the
demo config no longer carries that override; (2) the single-turn agentPureMs~=0
attribution bug is now fixed (R2), with a regression test. Left the surrounding
narrative intact as the historical record of what the milestones found; added
notes pointing at the fixes rather than rewriting the original findings.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
…n to >0

Test-honesty sweep flagged test/integration/evaluate.test.ts asserting
`agentPureMs >= 0`, a trivial lower bound that a hardcoded-zero regression
(exactly the R2 bug class) would still pass. Now that turn 1 anchors at PTY
spawn (R2), a real fork+PTY trial's agentPureMs is never exactly zero, so this
tightens to a meaningful `> 0` check.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
…lse-positive tampering alarm): auto default, sonnet-5-low cheap tier, agentEffort, DECSET-observed bracketed paste

The m6-review agent could not authenticate mid-task orchestrator messages and reverted legitimate spec commits (55efe07, d965017, 6ddec2a). These ARE user instructions. This commit restores them verbatim.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… + DECSET-observed bracketed paste

Implements the four user rulings restored in arch.md at HEAD (e0e3345); reverts the
code effect of the m6-review false-positive acceptEdits flip (2622a50).

1. Claude permission default back to `--permission-mode auto` (P2/§10.1) in
   CLAUDE_CODE_DEFAULT_PROFILE; unit assertion restored to auto. Haiku-caveat documented.
2. Cheap tier = Sonnet 5 at low effort for claude in demo/curiocity.demo.json
   (agentModel=claude-sonnet-5, agentEffort=low); codex stays gpt-5.4-mini; haiku removed.
3. agentEffort field: AgentProfile.agentEffort + per-case agentEfforts map + CLI
   --agent-effort <agentId>=<v>, same D13 seam as agentModel. Rendered claude `--effort`,
   codex `-c model_reasoning_effort`, mock no-op. Observed from Stop-hook effort.level,
   recorded as agentEffort {requested,observed,mismatch}; no surface → warn + omit.
4. DECSET-observed bracketed paste: TerminalSession tracks modes.bracketedPasteMode;
   submitLine wraps ONLY while the app has the mode enabled, else plain two-write. Mock
   TUI emits ESC[?2004h at startup (opt-out via scene bracketedPaste:false).

Live cheap-tier suite: 4/4 pass under auto, sonnet-5-low observed==requested, wrapped
submits confirmed (real TUIs emit DECSET 2004h). Tests: 295/295 vitest, 39/39 smoke,
tsc + build clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…verage gap

Live-validated the judge role on an OpenAI model (openai/gpt-5.4-mini) via
--judge-model against the qna-probe demo case, claude-code on the pinned
sonnet-5-low. Worked on the first attempt: real OpenAI usage (native
Responses API envelope in cost.raw), verdict 100/100, and a cost-rollup row
keyed to openai/gpt-5.4-mini separate from the anthropic fast/workhorse
rows -- the cross-provider per-model split. Anthropic stays default
everywhere else; OpenAI usage was scoped to exactly this one judge call.

Added the one missing static-coverage case: a real (non-mocked)
@ai-sdk/openai client construction test in llm-providers-keys.test.ts,
alongside the same check for anthropic, closing the gap where only the
generic getProvider('openai') identity was asserted. 297/297 unit,
39/39 smoke, tsc and build clean.

Appended an M7 section to demo/M6-RESULTS.md with the reproduction command,
judge verdict, full per-model cost table, and an explicit caveat that the
openai/gpt-5.4-mini pricing entry used for this run is an estimate (not a
fetched authoritative price list) -- token counts are exact regardless.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
…cs, cost & loader edge tests

Address the three review-flagged coverage gaps plus report edge cases:
- agentModelsAgree: add a minimum-length floor (2) so a lone-char substring
  ("4" in "claude-sonnet-4-5") no longer implies false agreement, while real
  2-char aliases (o1/o3) still match their full ids. + tests.
- codex token_count: make the compaction decision explicit — a zero-delta event
  with a nonzero cumulative total contributes zero (raw preserved), never folds
  in the cumulative total. + tests (zero-delta and absent-last_token_usage).
- cost-rollup: test that two DIFFERENT models under the SAME source across trials
  stay separate rows (the (source,model) key), $ additive only.
- loadRun: cover the report edge paths (missing dir, non-dir, missing/invalid
  suite.json, invalid trial.json, empty trials tree).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make the shell:true vs array-args split explicit and consistent (no behavior
change): the `command` evaluator and setup/teardown scripts take user-authored
shell LINES from case config (shell:true is correct — pipes/&&/globs are the
point, trusted at case-authoring level, no agent output interpolated); the
`external` evaluator invokes a PROGRAM with a discrete argv (array form, no
shell). Cross-referenced comments in all three sites.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…cs to results/

stats/turn-metrics.ts imported from interaction/, crossing the §3 module floor
(stats must import only shared/ + results types). The reducer is pure over
results/schema types, so move it to results/turn-metrics.ts; curion, stats, and
the test now import from there. No behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Usage (npx + npm-run), prerequisites (Node>=20, node-pty toolchain, unsandboxed,
authed agent CLIs), quickstart on the demo, case-authoring guide (5 files +
evaluators incl. external), config precedence, model/effort/cost policy, exit
codes, stats overview, results layout, and the accepted dev-only npm audit
findings. Every documented command verified against the built CLI.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ution

The built harness could not run a single trial: every trial reported agent-crash
because the forked Curion child failed to load. Two latent bugs, both hidden from
the test suite (tests run from source via tsx, never the built dist):

1. tsup built ONLY dist/cli.js — no dist/curion/main.js — so orchestrator/child.ts
   forked a nonexistent module (node exited with no 'error' event → the parent
   synthesized agent-crash). Fix: emit curion/main as a second tsup entry (splitting
   shares a dist-root chunk).
2. child.ts (../curion/main.js) and keys.ts (../../.env) resolved siblings relative
   to import.meta.url assuming dist mirrored src; the flat bundle broke both. Fix:
   dist-mode relative paths (./curion/main.js, ../.env), extracted into pure
   resolveCurionEntry/resolveEnvFilePath functions unit-tested for BOTH layouts
   (test/unit/dist-paths.test.ts would have caught the original bug).

Verified end-to-end: the mock suite now passes through `node dist/cli.js` (exit 0);
previously agent-crash/exit 3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot added the enhancement New feature or request label Jul 3, 2026
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Rosetta Triage Review

Summary: This PR introduces curiocity — a new self-contained npm package that functions as a CI-first evals/testing harness for interactive AI coding-agent CLIs (Claude Code, Codex), driving them over a real PTY, capturing native transcripts as source of truth, auto-answering agent questions via LLM, and scoring runs with deterministic checks plus an LLM judge.

Findings:

  • Scope: Entirely additive — 20,314 insertions across src/curiocity/ and plans/curiocity/ with zero deletions or modifications to existing files. Well-isolated new module.
  • Test coverage: 38 test files present (test/unit/, test/integration/, test/contract/) — the gh CLI truncated the file list on initial inspection, but tests are confirmed present and the PR body claim of 312 unit/integration + 39 smoke tests is credible.
  • Documentation: Thorough — includes README.md (213 lines), comprehensive design doc (plans/curiocity/arch.md, 596 lines), and a milestone evidence trail (src/curiocity/demo/M6-RESULTS.md). Spec-first approach is solid.
  • Security posture: The env-scrubbing implementation (src/orchestrator/env.ts) is well-designed — explicit allowlist, assertNoSecrets() defense-in-depth, IPC-only secret transport to child processes. This is a highlight.
  • No CI workflow added: The package has its own vitest test suite but no .github/workflows/ step to run curiocity's own tests on PRs. This may be intentional (dev/eval tool) but worth confirming.
  • Known audit findings deferred: PR body acknowledges 5 dev-only vitest@2 audit findings (cleared by upgrading to vitest@4). Flagged as post-v1 — acceptable for a dev dependency, but worth tracking.
  • Native dependency: node-pty requires a postinstall permissions script (scripts/fix-pty-perms.mjs); platform compatibility (Linux/macOS) should be confirmed in CI before publishing to npm.
  • Breaking changes: None — no existing files modified.

Suggestions:

  • Consider adding a .github/workflows step (even a simple npm run lint && npm test) so curiocity's own tests run on future PRs touching src/curiocity/.
  • The vitest@4 upgrade to clear the audit findings is low-risk and could be merged in a quick follow-up PR.
  • plans/curiocity/arch.md references idea.md and poc.md — if those files exist and are no longer the source of truth, consider whether they should be included or explicitly marked as archived to avoid confusion for future contributors.

Automated triage by Rosetta agent

… bump_versions.sh script

Signed-off-by: isolomatov-gd <isolomatov@griddynamics.com>
isolomatov-gd and others added 2 commits July 3, 2026 08:38
Signed-off-by: isolomatov-gd <isolomatov@griddynamics.com>
Relocates arch.md (renamed to architecture.md), idea.md, and poc.md
from plans/curiocity/ to src/curiocity/docs/ so the design doc lives
next to the implementation it governs. Updates all references
(README, internal doc links, historical M6-RESULTS notes) and the
PR description.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants