Skip to content

Multi-harness CI eval rig: harness adapters + local-Docker backend + with/without-Lightcone A/B (LCR-131)#154

Open
cailmdaley wants to merge 9 commits into
daley/integrationfrom
cailmdaley/lcr-131-multi-harness-ci-test-rig-for-skillsplugins-the-unblocker
Open

Multi-harness CI eval rig: harness adapters + local-Docker backend + with/without-Lightcone A/B (LCR-131)#154
cailmdaley wants to merge 9 commits into
daley/integrationfrom
cailmdaley/lcr-131-multi-harness-ci-test-rig-for-skillsplugins-the-unblocker

Conversation

@cailmdaley

@cailmdaley cailmdaley commented Jun 14, 2026

Copy link
Copy Markdown
Member

What this is

LCR-131 — extend the per-PR eval (lc eval, the supernova/snae rig) so a skill/plugin change gets graded across multiple agent harnesses, and so we can measure whether the Lightcone layer itself helps an agent. This PR delivers the rig and a first end-to-end run.

It's a local-Docker demo, not yet wired into CI (GitHub Actions + org-owned keys deferred — see open questions).

What's built

The one place Claude Code was hardwired (EvalSandbox.exec_claude) is now a clean seam:

  • harnesses/ — a Harness ABC (install · credentials · prepare · invoke) + a normalized AgentResult, with claude / codex / pi adapters and a registry. Grading is untouched — it reads the materialized filesystem (lc status, astra validate), never the agent — so it stays harness-agnostic.
  • backends/ — a Sandbox ABC + LocalDockerSandbox: one throwaway container per trial from a cached per-harness image (python 3.12 + Node 22 + claude/codex/pi). Auth is copied in from the host (claude via CLAUDE_CODE_OAUTH_TOKEN, codex via ~/.codex, pi via ~/.pi). The existing Daytona backend stays for the eventual CI path.
  • Matrix run + reportrun_eval schedules tasks × harnesses × skill_variants × trials; the report grows a task × harness × {with, without}-Lightcone scorecard with a Δ-lift column (the with−without delta). Per-harness model is configurable.
  • Fixes: a Daytona auto_stop_interval backstop (the budget-leak bug named in the issue — a skipped teardown could otherwise leak a forever-running sandbox); and the snae seed astra.yaml conformed to astra-spec 0.0.10 (it had drifted to fields the current schema rejects and failed astra validate).

49/49 eval unit tests green.

First results

{claude, codex, pi} × {with, without skills} on snae (fit Union2.1 SNe Ia to flat ΛCDM via MAP), real agents in local Docker, on a deliberately comparable, cheap model tier — claude haiku, codex gpt-5.4-mini, pi gpt-5.4-mini (via GitHub Copilot):

        Eval Matrix: score by harness × Lightcone layer
  Task   Harness   with skills        without skills     Δ lift
  snae   claude    1.00  pass 100%    1.00  pass 100%    +0.00   ($0.42 / $0.45)
  snae   codex     1.00  pass 100%    1.00  pass 100%    +0.00
  snae   pi        1.00  pass*  0%    1.00  pass*  0%    +0.00

The rig works end to end across all three harnesses — 6/6 trials, both arms, fully built the analysis (wrote the fit + plot scripts, ran them through lc run, materialized all three outputs) and passed spec_valid. claude authenticates via an OAuth token, codex via ChatGPT-account auth, pi via a GitHub Copilot subscription — all inside throwaway local containers.

The score-lift is +0.00 — snae saturates, even on this weaker/cheaper tier. Every arm hits 1.00. And the efficiency signal we hoped might carry the lift does not robustly replicate: durations are close and mixed (claude with/without 328 / 370 s, codex 363 / 442 s, pi 596 / 559 s — pi reversed) — noise at n=1, not a clean with-beats-without ordering. (An earlier run on the harnesses' default models showed a dramatic pi efficiency gap, but that was a weak default model floundering without guidance, not a stable effect.)

Conclusion: snae can't discriminate the Lightcone layer's value — it's too easy. The rig is sound; the task set is the work.

* pi reads 1.00 / pass 0% because it builds everything correctly but doesn't emit the BUILD_COMPLETE marker the way Claude Code does — its result is real; the completion flag is a per-harness detection nuance.

Open questions (this is where feedback helps most)

  1. Designing discriminating tasks — now the central question. What analyses are hard enough that the Lightcone layer changes the outcome, not just the wall-clock? (more outputs, trickier spec features, deliberate failure-diagnosis paths.)
  2. Trials per cell + what to report — n=1 is too noisy; we likely want pass@k over several trials, and possibly cost/turns alongside score.
  3. Model tier — this run standardized on haiku / gpt-5.4-mini / gpt-5.4-mini (comparable, cheap). Is that the tier we certify on, or a spread?
  4. CI surface — local Docker today; GitHub Actions vs Daytona for the actual per-PR check, and org-owned credentials rather than personal ones.
  5. pi completion detection — small adapter refinement so a correct pi build registers as complete.

🤖 Generated with Claude Code

— Claude, on behalf of Cail

cailmdaley and others added 8 commits June 14, 2026 00:29
…dbox

auto_stop_interval=0 disabled Daytona's auto-stop on the assumption that
teardown() always runs. Any path where it doesn't — unhandled exception,
killed process, cancelled CI job — leaks a sandbox that runs forever and
burns the compute budget (the LCR-131 blocker). Set a 30-minute idle
backstop: an active trial polls every ~10s so it never idles out, while
an orphaned sandbox stops itself.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The trial loop hardwired Claude Code via EvalSandbox.exec_claude. Extract a
Harness ABC (install / credentials / prepare / invoke) with AgentResult as the
normalized return type, and move the Claude logic into ClaudeHarness. The
sandbox now exposes exec_async_poll and delegates exec_claude to the harness;
ClaudeResult and _parse_claude_output remain as back-compat re-exports.

This freezes the interface the codex and pi adapters plug into. Grading is
untouched (it reads the filesystem, not the agent). Behaviour-identical —
49/49 eval tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CodexHarness (codex exec --json --dangerously-bypass-approvals-and-sandbox)
and PiHarness (pi -p --mode json, with native --skill/--no-skills gating the
with/without-skills A/B). Both register in the harness registry; all three
(claude, codex, pi) instantiate against the frozen interface. JSON output
shapes for codex/pi are parsed tolerantly and get confirmed against real
transcripts on the first local run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
run_eval now schedules tasks × harnesses × skill_variants × trials. run_trial
selects the backend (daytona | local_docker), instantiates the harness, runs
prepare() (strips the Lightcone scaffold for the bare arm) then invoke(), and
stamps harness/with_skills onto the trial. EvalRunConfig gains harnesses (with
optional per-harness model), skill_variants, backend, and max_turns/
trial_timeout overrides; TrialResult gains harness/with_skills. report adds a
task × harness × {with,without} matrix with the with−without Δ lift — the A/B
headline. Local demo + smoke configs added. Tests updated to patch get_harness.
49/49 green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sandbox ABC + ExecuteResult (backends/base.py) and LocalDockerSandbox: builds
a cached per-harness image (python:3.12-slim + Node 22 via NodeSource +
claude/codex/pi) and runs each trial in a throwaway container. Auth is copied
in from the host (~/.claude.json, ~/.codex/auth.json, ~/.pi/agent/*) and
chowned to evaluser — host 0600 files won't bind-mount readable. exec is a
blocking docker exec; setup() mirrors EvalSandbox (wheel install, lc init,
seed overlay, prompt staging, git seed). Node 22 (not Debian's 20) because pi
requires >=22.19. run_trial passes eval-metadata env explicitly since this
backend forwards only caller-supplied env_vars.

Verified in-container: claude 2.1.177, codex-cli 0.139.0, pi 0.79.3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ke timeout

From the first local-Docker smoke (see first-smoke-findings fiber):
- add jq to the image (claude's session-start hook calls it)
- codex off the spark model: spark needs API-key auth and is rejected on a
  ChatGPT-account login; use codex's default model
- local-smoke trial_timeout 900->180: pi/codex have no max-turns flag, so a
  short timeout is what bounds them (pi otherwise runs to the full timeout)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- LocalDockerSandbox no longer copies host ~/.claude.json into trial containers:
  claude authenticates via CLAUDE_CODE_OAUTH_TOKEN (forwarded env) on the
  image's onboarding file. The host file lacks the token on macOS (keychain)
  and would drag MCP/project state into every container. Verified in-container.
- local-demo: remove the turn cap (agents run to completion), add a 30-min
  trial_timeout safety ceiling, concurrency 3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The seed validated against an older/different spec and failed `astra validate`
(top-level `description`, `recipe.inputs` rejected) — so `spec_valid` was a
constant 0. Rewritten to current conventions via the /lightcone skill +
astra.md reference, preserving the science: Union2.1 -> flat LCDM MAP fit,
same three outputs (best_fit, hubble_diagram, residuals) with recipe commands,
all decisions. `description` -> narrative.summary (+ required narrative
sections); `recipe.inputs` -> Output.inputs data flow; added required id: snae;
decisions wired into recipe commands as {decisions.x} flags. Validates clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 14, 2026

Copy link
Copy Markdown

❌ Eval Results

Metric Value
Score 0.00
Build complete
Cost $0.00
Turns 0
Duration 0s
lightcone-cli 0.3.7.dev47+g9d1c15fa5 (9d1c15fa)
Results Download

Graders

No grader results

Full output
01:18:17 lightcone.eval.build Building lightcone-cli wheel from /home/runner/work/lightcone-cli/lightcone-cli ...
01:18:23 lightcone.eval.build Built lightcone_cli-0.3.7.dev47+g9d1c15fa5-py3-none-any.whl (commit 9d1c15fa)
01:18:25 lightcone.eval.harness Trial build-snae-claude-skills-0 failed: Failed to create sandbox: Invalid credentials
Traceback (most recent call last):
  File "/home/runner/work/lightcone-cli/lightcone-cli/src/lightcone/eval/harness.py", line 149, in run_trial
    sandbox.create()
  File "/home/runner/work/lightcone-cli/lightcone-cli/src/lightcone/eval/sandbox.py", line 158, in create
    self._sandbox = self._daytona.create(
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/lightcone-cli/lightcone-cli/.venv/lib/python3.12/site-packages/daytona_sdk/_utils/errors.py", line 206, in sync_wrapper
    process_n_raise_exception(e)
  File "/home/runner/work/lightcone-cli/lightcone-cli/.venv/lib/python3.12/site-packages/daytona_sdk/_utils/errors.py", line 139, in process_n_raise_exception
    raise create_daytona_error(
daytona_sdk.common.errors.DaytonaAuthenticationError: Failed to create sandbox: Invalid credentials
  snae trial 0: score=0.00 error: Failed to create sandbox: Invalid credentials

lightcone-cli: 0.3.7.dev47+g9d1c15fa5 (HEAD 9d1c15fa)
ASTRA: 0.2.9

  Eval Results: Scores  
┏━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Task ┃     Score     ┃
┡━━━━━━╇━━━━━━━━━━━━━━━┩
│ snae │ 0.00 +/- 0.00 │
│      │  pass@k: 0%   │
│      │   1 errors    │
└──────┴───────────────┘

   Eval Results: Cost &   
         Duration         
┏━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Task ┃ Cost / Duration ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ snae │      $0.00      │
│      │       0s        │
└──────┴─────────────────┘

Total: 1 trials, $0.00, 0s

   Eval Matrix: score by harness × Lightcone layer   
┏━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Task ┃ Harness ┃   with skills   ┃ without skills ┃
┡━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ snae │ claude  │ 0.00  pass@k 0% │       —        │
│      │         │      1 err      │                │
└──────┴─────────┴─────────────────┴────────────────┘

Results saved to: eval-results/build-9d1c15fa/results.json

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 14, 2026

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
lightcone-cli b7fd9e4 Commit Preview URL

Branch Preview URL
Jun 14 2026, 01:19 AM

…5.4-mini

codex Spark was retired; instead codex and pi both run gpt-5.4-mini and claude
runs haiku, for a comparable cheap tier (and an effectively harder task than
the prior defaults). pi reaches gpt-5.4-mini via GitHub Copilot using its
native provider/model string (github-copilot/gpt-5.4-mini); pi's auth carries
the copilot provider, so it resolves in-container.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant