Skip to content

Add "beat the robot" challenge mode to the playground#12

Merged
rsasaki0109 merged 1 commit into
mainfrom
playground-challenge-mode
Jun 4, 2026
Merged

Add "beat the robot" challenge mode to the playground#12
rsasaki0109 merged 1 commit into
mainfrom
playground-challenge-mode

Conversation

@rsasaki0109

Copy link
Copy Markdown
Owner

What

A flashy, sticky follow-on to the Phase-3 edit cell: a "Score 10 seeds" button in the pick_and_retry scenario that scores your edited agent vs. the shipped one across 10 seeds and shows a baseline-vs-you scoreboard (success rate, reward, steps, retries, grasp misses) with a 🏆 win / lose verdict.

Why across many seeds (not one)

Single-seed scoring rewards overfitting — drop the retry schedule and seed 3 solves in 2 steps. Scoring across seeds doesn't: an agent that ignores its belief and stabs a fixed spot falls to a 0% success rate. So the challenge turns "edit the brain" into something you want to share a score from, and teaches why the retry/belief logic exists.

Changes

  • pir/viz/playground_trace.pyscore_pick_and_retry(run_agent, agent_factory, seeds) aggregates trace summaries. run_agent is passed in, keeping the module decoupled from examples/.
  • docs/playground.{html,js,css} — challenge panel (pick_and_retry only); a driver that scores baseline + edited agent and returns plain JSON; scoreboard with per-metric "better" highlighting and a verdict banner.
  • tests/test_playground_trace.py — scorer shape; the shipped agent is 100% robust over the seed set; a belief-ignoring agent scores strictly worse.

Verification

  • Full suite green (124 tests).
  • Verified the exact driver in an unpacked-bundle sim: shipped agent 100% / reward 0.65 vs a belief-ignoring agent 0% / reward -1.2.

Still pending (browser, your side)

Pick "Pick and retry (real Python)", edit the agent, click Score 10 seeds, and confirm the scoreboard + verdict render.

🤖 Generated with Claude Code

Add a "Score 10 seeds" button to the pick_and_retry scenario that scores the
edited agent against the shipped one across 10 seeds and shows a baseline-vs-you
scoreboard (success rate, reward, steps, retries, grasp misses) with a win/lose
verdict.

Scoring across many seeds is the point: a single-seed score rewards overfitting
(dropping the retry schedule solves seed 3 in 2 steps), but robustness across
seeds does not — an agent that ignores its belief drops to a 0% success rate.
This turns "edit the brain" into a sticky, shareable challenge and teaches why
the retry/belief logic exists.

- pir/viz/playground_trace.py: score_pick_and_retry(run_agent, agent_factory,
  seeds) aggregates trace summaries; run_agent is passed in to keep the module
  decoupled from examples/.
- docs/playground.{html,js,css}: challenge panel (pick_and_retry only), driver
  that scores baseline + edited agent and returns plain JSON, scoreboard with
  per-metric "better" highlighting and verdict.
- tests/test_playground_trace.py: scorer shape, shipped agent is 100% robust,
  and a belief-ignoring agent scores strictly worse.

Verified the exact driver in an unpacked-bundle sim: shipped agent 100% /
reward 0.65 vs a belief-ignoring agent 0% / reward -1.2. Full suite green
(124 tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rsasaki0109 rsasaki0109 merged commit 949dc5c into main Jun 4, 2026
3 checks passed
@rsasaki0109 rsasaki0109 deleted the playground-challenge-mode branch June 4, 2026 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant