Multi-harness CI eval rig: harness adapters + local-Docker backend + with/without-Lightcone A/B (LCR-131) by cailmdaley · Pull Request #154 · LightconeResearch/lightcone-cli

cailmdaley · 2026-06-14T00:42:28Z

What this is

LCR-131 — extend the per-PR eval (lc eval, the supernova/snae rig) so a skill/plugin change gets graded across multiple agent harnesses, and so we can measure whether the Lightcone layer itself helps an agent. This PR delivers the rig and a first end-to-end run.

It's a local-Docker demo, not yet wired into CI (GitHub Actions + org-owned keys deferred — see open questions).

What's built

The one place Claude Code was hardwired (EvalSandbox.exec_claude) is now a clean seam:

harnesses/ — a Harness ABC (install · credentials · prepare · invoke) + a normalized AgentResult, with claude / codex / pi adapters and a registry. Grading is untouched — it reads the materialized filesystem (lc status, astra validate), never the agent — so it stays harness-agnostic.
backends/ — a Sandbox ABC + LocalDockerSandbox: one throwaway container per trial from a cached per-harness image (python 3.12 + Node 22 + claude/codex/pi). Auth is copied in from the host (claude via CLAUDE_CODE_OAUTH_TOKEN, codex via ~/.codex, pi via ~/.pi). The existing Daytona backend stays for the eventual CI path.
Matrix run + report — run_eval schedules tasks × harnesses × skill_variants × trials; the report grows a task × harness × {with, without}-Lightcone scorecard with a Δ-lift column (the with−without delta). Per-harness model is configurable.
Fixes: a Daytona auto_stop_interval backstop (the budget-leak bug named in the issue — a skipped teardown could otherwise leak a forever-running sandbox); and the snae seed astra.yaml conformed to astra-spec 0.0.10 (it had drifted to fields the current schema rejects and failed astra validate).

49/49 eval unit tests green.

First results

{claude, codex, pi} × {with, without skills} on snae (fit Union2.1 SNe Ia to flat ΛCDM via MAP), real agents in local Docker, on a deliberately comparable, cheap model tier — claude haiku, codex gpt-5.4-mini, pi gpt-5.4-mini (via GitHub Copilot):

        Eval Matrix: score by harness × Lightcone layer
  Task   Harness   with skills        without skills     Δ lift
  snae   claude    1.00  pass 100%    1.00  pass 100%    +0.00   ($0.42 / $0.45)
  snae   codex     1.00  pass 100%    1.00  pass 100%    +0.00
  snae   pi        1.00  pass*  0%    1.00  pass*  0%    +0.00

The rig works end to end across all three harnesses — 6/6 trials, both arms, fully built the analysis (wrote the fit + plot scripts, ran them through lc run, materialized all three outputs) and passed spec_valid. claude authenticates via an OAuth token, codex via ChatGPT-account auth, pi via a GitHub Copilot subscription — all inside throwaway local containers.

The score-lift is +0.00 — snae saturates, even on this weaker/cheaper tier. Every arm hits 1.00. And the efficiency signal we hoped might carry the lift does not robustly replicate: durations are close and mixed (claude with/without 328 / 370 s, codex 363 / 442 s, pi 596 / 559 s — pi reversed) — noise at n=1, not a clean with-beats-without ordering. (An earlier run on the harnesses' default models showed a dramatic pi efficiency gap, but that was a weak default model floundering without guidance, not a stable effect.)

Conclusion: snae can't discriminate the Lightcone layer's value — it's too easy. The rig is sound; the task set is the work.

* pi reads 1.00 / pass 0% because it builds everything correctly but doesn't emit the BUILD_COMPLETE marker the way Claude Code does — its result is real; the completion flag is a per-harness detection nuance.

Open questions (this is where feedback helps most)

Designing discriminating tasks — now the central question. What analyses are hard enough that the Lightcone layer changes the outcome, not just the wall-clock? (more outputs, trickier spec features, deliberate failure-diagnosis paths.)
Trials per cell + what to report — n=1 is too noisy; we likely want pass@k over several trials, and possibly cost/turns alongside score.
Model tier — this run standardized on haiku / gpt-5.4-mini / gpt-5.4-mini (comparable, cheap). Is that the tier we certify on, or a spread?
CI surface — local Docker today; GitHub Actions vs Daytona for the actual per-PR check, and org-owned credentials rather than personal ones.
pi completion detection — small adapter refinement so a correct pi build registers as complete.

🤖 Generated with Claude Code

— Claude, on behalf of Cail

…dbox auto_stop_interval=0 disabled Daytona's auto-stop on the assumption that teardown() always runs. Any path where it doesn't — unhandled exception, killed process, cancelled CI job — leaks a sandbox that runs forever and burns the compute budget (the LCR-131 blocker). Set a 30-minute idle backstop: an active trial polls every ~10s so it never idles out, while an orphaned sandbox stops itself. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The trial loop hardwired Claude Code via EvalSandbox.exec_claude. Extract a Harness ABC (install / credentials / prepare / invoke) with AgentResult as the normalized return type, and move the Claude logic into ClaudeHarness. The sandbox now exposes exec_async_poll and delegates exec_claude to the harness; ClaudeResult and _parse_claude_output remain as back-compat re-exports. This freezes the interface the codex and pi adapters plug into. Grading is untouched (it reads the filesystem, not the agent). Behaviour-identical — 49/49 eval tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CodexHarness (codex exec --json --dangerously-bypass-approvals-and-sandbox) and PiHarness (pi -p --mode json, with native --skill/--no-skills gating the with/without-skills A/B). Both register in the harness registry; all three (claude, codex, pi) instantiate against the frozen interface. JSON output shapes for codex/pi are parsed tolerantly and get confirmed against real transcripts on the first local run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

run_eval now schedules tasks × harnesses × skill_variants × trials. run_trial selects the backend (daytona | local_docker), instantiates the harness, runs prepare() (strips the Lightcone scaffold for the bare arm) then invoke(), and stamps harness/with_skills onto the trial. EvalRunConfig gains harnesses (with optional per-harness model), skill_variants, backend, and max_turns/ trial_timeout overrides; TrialResult gains harness/with_skills. report adds a task × harness × {with,without} matrix with the with−without Δ lift — the A/B headline. Local demo + smoke configs added. Tests updated to patch get_harness. 49/49 green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Sandbox ABC + ExecuteResult (backends/base.py) and LocalDockerSandbox: builds a cached per-harness image (python:3.12-slim + Node 22 via NodeSource + claude/codex/pi) and runs each trial in a throwaway container. Auth is copied in from the host (~/.claude.json, ~/.codex/auth.json, ~/.pi/agent/*) and chowned to evaluser — host 0600 files won't bind-mount readable. exec is a blocking docker exec; setup() mirrors EvalSandbox (wheel install, lc init, seed overlay, prompt staging, git seed). Node 22 (not Debian's 20) because pi requires >=22.19. run_trial passes eval-metadata env explicitly since this backend forwards only caller-supplied env_vars. Verified in-container: claude 2.1.177, codex-cli 0.139.0, pi 0.79.3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ke timeout From the first local-Docker smoke (see first-smoke-findings fiber): - add jq to the image (claude's session-start hook calls it) - codex off the spark model: spark needs API-key auth and is rejected on a ChatGPT-account login; use codex's default model - local-smoke trial_timeout 900->180: pi/codex have no max-turns flag, so a short timeout is what bounds them (pi otherwise runs to the full timeout) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- LocalDockerSandbox no longer copies host ~/.claude.json into trial containers: claude authenticates via CLAUDE_CODE_OAUTH_TOKEN (forwarded env) on the image's onboarding file. The host file lacks the token on macOS (keychain) and would drag MCP/project state into every container. Verified in-container. - local-demo: remove the turn cap (agents run to completion), add a 30-min trial_timeout safety ceiling, concurrency 3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The seed validated against an older/different spec and failed `astra validate` (top-level `description`, `recipe.inputs` rejected) — so `spec_valid` was a constant 0. Rewritten to current conventions via the /lightcone skill + astra.md reference, preserving the science: Union2.1 -> flat LCDM MAP fit, same three outputs (best_fit, hubble_diagram, residuals) with recipe commands, all decisions. `description` -> narrative.summary (+ required narrative sections); `recipe.inputs` -> Output.inputs data flow; added required id: snae; decisions wired into recipe commands as {decisions.x} flags. Validates clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-14T00:42:53Z

❌ Eval Results

Metric	Value
Score	0.00
Build complete	❌
Cost	$0.00
Turns	0
Duration	0s
lightcone-cli	`0.3.7.dev47+g9d1c15fa5` (`9d1c15fa`)
Results	Download

Graders

No grader results

Full output

01:18:17 lightcone.eval.build Building lightcone-cli wheel from /home/runner/work/lightcone-cli/lightcone-cli ...
01:18:23 lightcone.eval.build Built lightcone_cli-0.3.7.dev47+g9d1c15fa5-py3-none-any.whl (commit 9d1c15fa)
01:18:25 lightcone.eval.harness Trial build-snae-claude-skills-0 failed: Failed to create sandbox: Invalid credentials
Traceback (most recent call last):
  File "/home/runner/work/lightcone-cli/lightcone-cli/src/lightcone/eval/harness.py", line 149, in run_trial
    sandbox.create()
  File "/home/runner/work/lightcone-cli/lightcone-cli/src/lightcone/eval/sandbox.py", line 158, in create
    self._sandbox = self._daytona.create(
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/lightcone-cli/lightcone-cli/.venv/lib/python3.12/site-packages/daytona_sdk/_utils/errors.py", line 206, in sync_wrapper
    process_n_raise_exception(e)
  File "/home/runner/work/lightcone-cli/lightcone-cli/.venv/lib/python3.12/site-packages/daytona_sdk/_utils/errors.py", line 139, in process_n_raise_exception
    raise create_daytona_error(
daytona_sdk.common.errors.DaytonaAuthenticationError: Failed to create sandbox: Invalid credentials
  snae trial 0: score=0.00 error: Failed to create sandbox: Invalid credentials

lightcone-cli: 0.3.7.dev47+g9d1c15fa5 (HEAD 9d1c15fa)
ASTRA: 0.2.9

  Eval Results: Scores  
┏━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Task ┃     Score     ┃
┡━━━━━━╇━━━━━━━━━━━━━━━┩
│ snae │ 0.00 +/- 0.00 │
│      │  pass@k: 0%   │
│      │   1 errors    │
└──────┴───────────────┘

   Eval Results: Cost &   
         Duration         
┏━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Task ┃ Cost / Duration ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ snae │      $0.00      │
│      │       0s        │
└──────┴─────────────────┘

Total: 1 trials, $0.00, 0s

   Eval Matrix: score by harness × Lightcone layer   
┏━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Task ┃ Harness ┃   with skills   ┃ without skills ┃
┡━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ snae │ claude  │ 0.00  pass@k 0% │       —        │
│      │         │      1 err      │                │
└──────┴─────────┴─────────────────┴────────────────┘

Results saved to: eval-results/build-9d1c15fa/results.json

cloudflare-workers-and-pages · 2026-06-14T00:43:06Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	lightcone-cli	`b7fd9e4`	Commit Preview URL Branch Preview URL	Jun 14 2026, 01:19 AM

…5.4-mini codex Spark was retired; instead codex and pi both run gpt-5.4-mini and claude runs haiku, for a comparable cheap tier (and an effectively harder task than the prior defaults). pi reaches gpt-5.4-mini via GitHub Copilot using its native provider/model string (github-copilot/gpt-5.4-mini); pi's auth carries the copilot provider, so it resolves in-container. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cailmdaley and others added 8 commits June 14, 2026 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-harness CI eval rig: harness adapters + local-Docker backend + with/without-Lightcone A/B (LCR-131)#154

Multi-harness CI eval rig: harness adapters + local-Docker backend + with/without-Lightcone A/B (LCR-131)#154
cailmdaley wants to merge 9 commits into
daley/integrationfrom
cailmdaley/lcr-131-multi-harness-ci-test-rig-for-skillsplugins-the-unblocker

cailmdaley commented Jun 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 14, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cailmdaley commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this is

What's built

First results

Open questions (this is where feedback helps most)

Uh oh!

github-actions Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Eval Results

Graders

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cailmdaley commented Jun 14, 2026 •

edited

Loading

github-actions Bot commented Jun 14, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Jun 14, 2026 •

edited

Loading