Add "beat the robot" challenge mode to the playground#12
Merged
Conversation
Add a "Score 10 seeds" button to the pick_and_retry scenario that scores the
edited agent against the shipped one across 10 seeds and shows a baseline-vs-you
scoreboard (success rate, reward, steps, retries, grasp misses) with a win/lose
verdict.
Scoring across many seeds is the point: a single-seed score rewards overfitting
(dropping the retry schedule solves seed 3 in 2 steps), but robustness across
seeds does not — an agent that ignores its belief drops to a 0% success rate.
This turns "edit the brain" into a sticky, shareable challenge and teaches why
the retry/belief logic exists.
- pir/viz/playground_trace.py: score_pick_and_retry(run_agent, agent_factory,
seeds) aggregates trace summaries; run_agent is passed in to keep the module
decoupled from examples/.
- docs/playground.{html,js,css}: challenge panel (pick_and_retry only), driver
that scores baseline + edited agent and returns plain JSON, scoreboard with
per-metric "better" highlighting and verdict.
- tests/test_playground_trace.py: scorer shape, shipped agent is 100% robust,
and a belief-ignoring agent scores strictly worse.
Verified the exact driver in an unpacked-bundle sim: shipped agent 100% /
reward 0.65 vs a belief-ignoring agent 0% / reward -1.2. Full suite green
(124 tests).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A flashy, sticky follow-on to the Phase-3 edit cell: a "Score 10 seeds" button in the pick_and_retry scenario that scores your edited agent vs. the shipped one across 10 seeds and shows a baseline-vs-you scoreboard (success rate, reward, steps, retries, grasp misses) with a 🏆 win / lose verdict.
Why across many seeds (not one)
Single-seed scoring rewards overfitting — drop the retry schedule and seed 3 solves in 2 steps. Scoring across seeds doesn't: an agent that ignores its belief and stabs a fixed spot falls to a 0% success rate. So the challenge turns "edit the brain" into something you want to share a score from, and teaches why the retry/belief logic exists.
Changes
pir/viz/playground_trace.py—score_pick_and_retry(run_agent, agent_factory, seeds)aggregates trace summaries.run_agentis passed in, keeping the module decoupled fromexamples/.docs/playground.{html,js,css}— challenge panel (pick_and_retry only); a driver that scores baseline + edited agent and returns plain JSON; scoreboard with per-metric "better" highlighting and a verdict banner.tests/test_playground_trace.py— scorer shape; the shipped agent is 100% robust over the seed set; a belief-ignoring agent scores strictly worse.Verification
Still pending (browser, your side)
Pick "Pick and retry (real Python)", edit the agent, click Score 10 seeds, and confirm the scoreboard + verdict render.
🤖 Generated with Claude Code