Multi-harness CI eval rig: harness adapters + local-Docker backend + with/without-Lightcone A/B (LCR-131)#154
Open
cailmdaley wants to merge 9 commits into
Conversation
…dbox auto_stop_interval=0 disabled Daytona's auto-stop on the assumption that teardown() always runs. Any path where it doesn't — unhandled exception, killed process, cancelled CI job — leaks a sandbox that runs forever and burns the compute budget (the LCR-131 blocker). Set a 30-minute idle backstop: an active trial polls every ~10s so it never idles out, while an orphaned sandbox stops itself. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The trial loop hardwired Claude Code via EvalSandbox.exec_claude. Extract a Harness ABC (install / credentials / prepare / invoke) with AgentResult as the normalized return type, and move the Claude logic into ClaudeHarness. The sandbox now exposes exec_async_poll and delegates exec_claude to the harness; ClaudeResult and _parse_claude_output remain as back-compat re-exports. This freezes the interface the codex and pi adapters plug into. Grading is untouched (it reads the filesystem, not the agent). Behaviour-identical — 49/49 eval tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CodexHarness (codex exec --json --dangerously-bypass-approvals-and-sandbox) and PiHarness (pi -p --mode json, with native --skill/--no-skills gating the with/without-skills A/B). Both register in the harness registry; all three (claude, codex, pi) instantiate against the frozen interface. JSON output shapes for codex/pi are parsed tolerantly and get confirmed against real transcripts on the first local run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
run_eval now schedules tasks × harnesses × skill_variants × trials. run_trial
selects the backend (daytona | local_docker), instantiates the harness, runs
prepare() (strips the Lightcone scaffold for the bare arm) then invoke(), and
stamps harness/with_skills onto the trial. EvalRunConfig gains harnesses (with
optional per-harness model), skill_variants, backend, and max_turns/
trial_timeout overrides; TrialResult gains harness/with_skills. report adds a
task × harness × {with,without} matrix with the with−without Δ lift — the A/B
headline. Local demo + smoke configs added. Tests updated to patch get_harness.
49/49 green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sandbox ABC + ExecuteResult (backends/base.py) and LocalDockerSandbox: builds a cached per-harness image (python:3.12-slim + Node 22 via NodeSource + claude/codex/pi) and runs each trial in a throwaway container. Auth is copied in from the host (~/.claude.json, ~/.codex/auth.json, ~/.pi/agent/*) and chowned to evaluser — host 0600 files won't bind-mount readable. exec is a blocking docker exec; setup() mirrors EvalSandbox (wheel install, lc init, seed overlay, prompt staging, git seed). Node 22 (not Debian's 20) because pi requires >=22.19. run_trial passes eval-metadata env explicitly since this backend forwards only caller-supplied env_vars. Verified in-container: claude 2.1.177, codex-cli 0.139.0, pi 0.79.3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ke timeout From the first local-Docker smoke (see first-smoke-findings fiber): - add jq to the image (claude's session-start hook calls it) - codex off the spark model: spark needs API-key auth and is rejected on a ChatGPT-account login; use codex's default model - local-smoke trial_timeout 900->180: pi/codex have no max-turns flag, so a short timeout is what bounds them (pi otherwise runs to the full timeout) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- LocalDockerSandbox no longer copies host ~/.claude.json into trial containers: claude authenticates via CLAUDE_CODE_OAUTH_TOKEN (forwarded env) on the image's onboarding file. The host file lacks the token on macOS (keychain) and would drag MCP/project state into every container. Verified in-container. - local-demo: remove the turn cap (agents run to completion), add a 30-min trial_timeout safety ceiling, concurrency 3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The seed validated against an older/different spec and failed `astra validate`
(top-level `description`, `recipe.inputs` rejected) — so `spec_valid` was a
constant 0. Rewritten to current conventions via the /lightcone skill +
astra.md reference, preserving the science: Union2.1 -> flat LCDM MAP fit,
same three outputs (best_fit, hubble_diagram, residuals) with recipe commands,
all decisions. `description` -> narrative.summary (+ required narrative
sections); `recipe.inputs` -> Output.inputs data flow; added required id: snae;
decisions wired into recipe commands as {decisions.x} flags. Validates clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
❌ Eval Results
GradersNo grader results Full output |
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
lightcone-cli | b7fd9e4 | Commit Preview URL Branch Preview URL |
Jun 14 2026, 01:19 AM |
…5.4-mini codex Spark was retired; instead codex and pi both run gpt-5.4-mini and claude runs haiku, for a comparable cheap tier (and an effectively harder task than the prior defaults). pi reaches gpt-5.4-mini via GitHub Copilot using its native provider/model string (github-copilot/gpt-5.4-mini); pi's auth carries the copilot provider, so it resolves in-container. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
LCR-131 — extend the per-PR eval (
lc eval, the supernova/snaerig) so a skill/plugin change gets graded across multiple agent harnesses, and so we can measure whether the Lightcone layer itself helps an agent. This PR delivers the rig and a first end-to-end run.It's a local-Docker demo, not yet wired into CI (GitHub Actions + org-owned keys deferred — see open questions).
What's built
The one place Claude Code was hardwired (
EvalSandbox.exec_claude) is now a clean seam:harnesses/— aHarnessABC (install·credentials·prepare·invoke) + a normalizedAgentResult, with claude / codex / pi adapters and a registry. Grading is untouched — it reads the materialized filesystem (lc status,astra validate), never the agent — so it stays harness-agnostic.backends/— aSandboxABC +LocalDockerSandbox: one throwaway container per trial from a cached per-harness image (python 3.12 + Node 22 + claude/codex/pi). Auth is copied in from the host (claude viaCLAUDE_CODE_OAUTH_TOKEN, codex via~/.codex, pi via~/.pi). The existing Daytona backend stays for the eventual CI path.run_evalschedulestasks × harnesses × skill_variants × trials; the report grows a task × harness × {with, without}-Lightcone scorecard with a Δ-lift column (the with−without delta). Per-harness model is configurable.auto_stop_intervalbackstop (the budget-leak bug named in the issue — a skipped teardown could otherwise leak a forever-running sandbox); and thesnaeseedastra.yamlconformed to astra-spec 0.0.10 (it had drifted to fields the current schema rejects and failedastra validate).49/49 eval unit tests green.
First results
{claude, codex, pi} × {with, without skills}onsnae(fit Union2.1 SNe Ia to flat ΛCDM via MAP), real agents in local Docker, on a deliberately comparable, cheap model tier — claudehaiku, codexgpt-5.4-mini, pigpt-5.4-mini(via GitHub Copilot):The rig works end to end across all three harnesses — 6/6 trials, both arms, fully built the analysis (wrote the fit + plot scripts, ran them through
lc run, materialized all three outputs) and passedspec_valid. claude authenticates via an OAuth token, codex via ChatGPT-account auth, pi via a GitHub Copilot subscription — all inside throwaway local containers.The score-lift is +0.00 —
snaesaturates, even on this weaker/cheaper tier. Every arm hits 1.00. And the efficiency signal we hoped might carry the lift does not robustly replicate: durations are close and mixed (claude with/without 328 / 370 s, codex 363 / 442 s, pi 596 / 559 s — pi reversed) — noise at n=1, not a clean with-beats-without ordering. (An earlier run on the harnesses' default models showed a dramatic pi efficiency gap, but that was a weak default model floundering without guidance, not a stable effect.)Conclusion:
snaecan't discriminate the Lightcone layer's value — it's too easy. The rig is sound; the task set is the work.* pi reads 1.00 /
pass 0%because it builds everything correctly but doesn't emit theBUILD_COMPLETEmarker the way Claude Code does — its result is real; the completion flag is a per-harness detection nuance.Open questions (this is where feedback helps most)
🤖 Generated with Claude Code
— Claude, on behalf of Cail