Add "beat the robot" challenge mode to the playground by rsasaki0109 · Pull Request #12 · rsasaki0109/PythonInteractiveRobotics

rsasaki0109 · 2026-06-04T22:41:21Z

What

A flashy, sticky follow-on to the Phase-3 edit cell: a "Score 10 seeds" button in the pick_and_retry scenario that scores your edited agent vs. the shipped one across 10 seeds and shows a baseline-vs-you scoreboard (success rate, reward, steps, retries, grasp misses) with a 🏆 win / lose verdict.

Why across many seeds (not one)

Single-seed scoring rewards overfitting — drop the retry schedule and seed 3 solves in 2 steps. Scoring across seeds doesn't: an agent that ignores its belief and stabs a fixed spot falls to a 0% success rate. So the challenge turns "edit the brain" into something you want to share a score from, and teaches why the retry/belief logic exists.

Changes

pir/viz/playground_trace.py — score_pick_and_retry(run_agent, agent_factory, seeds) aggregates trace summaries. run_agent is passed in, keeping the module decoupled from examples/.
docs/playground.{html,js,css} — challenge panel (pick_and_retry only); a driver that scores baseline + edited agent and returns plain JSON; scoreboard with per-metric "better" highlighting and a verdict banner.
tests/test_playground_trace.py — scorer shape; the shipped agent is 100% robust over the seed set; a belief-ignoring agent scores strictly worse.

Verification

Full suite green (124 tests).
Verified the exact driver in an unpacked-bundle sim: shipped agent 100% / reward 0.65 vs a belief-ignoring agent 0% / reward -1.2.

Still pending (browser, your side)

Pick "Pick and retry (real Python)", edit the agent, click Score 10 seeds, and confirm the scoreboard + verdict render.

🤖 Generated with Claude Code

Add a "Score 10 seeds" button to the pick_and_retry scenario that scores the edited agent against the shipped one across 10 seeds and shows a baseline-vs-you scoreboard (success rate, reward, steps, retries, grasp misses) with a win/lose verdict. Scoring across many seeds is the point: a single-seed score rewards overfitting (dropping the retry schedule solves seed 3 in 2 steps), but robustness across seeds does not — an agent that ignores its belief drops to a 0% success rate. This turns "edit the brain" into a sticky, shareable challenge and teaches why the retry/belief logic exists. - pir/viz/playground_trace.py: score_pick_and_retry(run_agent, agent_factory, seeds) aggregates trace summaries; run_agent is passed in to keep the module decoupled from examples/. - docs/playground.{html,js,css}: challenge panel (pick_and_retry only), driver that scores baseline + edited agent and returns plain JSON, scoreboard with per-metric "better" highlighting and verdict. - tests/test_playground_trace.py: scorer shape, shipped agent is 100% robust, and a belief-ignoring agent scores strictly worse. Verified the exact driver in an unpacked-bundle sim: shipped agent 100% / reward 0.65 vs a belief-ignoring agent 0% / reward -1.2. Full suite green (124 tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

rsasaki0109 merged commit 949dc5c into main Jun 4, 2026
3 checks passed

rsasaki0109 deleted the playground-challenge-mode branch June 4, 2026 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "beat the robot" challenge mode to the playground#12

Add "beat the robot" challenge mode to the playground#12
rsasaki0109 merged 1 commit into
mainfrom
playground-challenge-mode

rsasaki0109 commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rsasaki0109 commented Jun 4, 2026

What

Why across many seeds (not one)

Changes

Verification

Still pending (browser, your side)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant