From f558821e1f767a235b462de224696618fb371c28 Mon Sep 17 00:00:00 2001 From: rsasaki0109 Date: Fri, 5 Jun 2026 07:41:07 +0900 Subject: [PATCH] Add "beat the robot" challenge mode to the playground MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a "Score 10 seeds" button to the pick_and_retry scenario that scores the edited agent against the shipped one across 10 seeds and shows a baseline-vs-you scoreboard (success rate, reward, steps, retries, grasp misses) with a win/lose verdict. Scoring across many seeds is the point: a single-seed score rewards overfitting (dropping the retry schedule solves seed 3 in 2 steps), but robustness across seeds does not — an agent that ignores its belief drops to a 0% success rate. This turns "edit the brain" into a sticky, shareable challenge and teaches why the retry/belief logic exists. - pir/viz/playground_trace.py: score_pick_and_retry(run_agent, agent_factory, seeds) aggregates trace summaries; run_agent is passed in to keep the module decoupled from examples/. - docs/playground.{html,js,css}: challenge panel (pick_and_retry only), driver that scores baseline + edited agent and returns plain JSON, scoreboard with per-metric "better" highlighting and verdict. - tests/test_playground_trace.py: scorer shape, shipped agent is 100% robust, and a belief-ignoring agent scores strictly worse. Verified the exact driver in an unpacked-bundle sim: shipped agent 100% / reward 0.65 vs a belief-ignoring agent 0% / reward -1.2. Full suite green (124 tests). Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/playground.css | 81 +++++++++++++++ docs/playground.html | 8 ++ docs/playground.js | 155 ++++++++++++++++++++++++++++ docs/pyodide/pir_bundle.zip | Bin 32024 -> 32601 bytes docs/pyodide_playground_strategy.md | 11 ++ pir/viz/playground_trace.py | 43 ++++++++ tests/test_playground_trace.py | 42 ++++++++ 7 files changed, 340 insertions(+) diff --git a/docs/playground.css b/docs/playground.css index 15a6093..0d396e8 100644 --- a/docs/playground.css +++ b/docs/playground.css @@ -198,6 +198,87 @@ button:disabled { width: 100%; } +.challenge-panel { + border: 1px solid #d7dcd6; + border-radius: 10px; + margin-top: 12px; + overflow: hidden; +} + +.challenge-head { + align-items: center; + background: #eef1ec; + display: flex; + flex-wrap: wrap; + gap: 8px; + justify-content: space-between; + padding: 8px 12px; +} + +.challenge-head span { + color: var(--ink-soft, #5a6b67); + font-size: 0.82rem; +} + +.challenge-board { + padding: 12px; +} + +.challenge-empty { + color: var(--ink-soft, #5a6b67); + font-size: 0.85rem; +} + +.challenge-verdict { + border-radius: 8px; + font-weight: 800; + margin-bottom: 10px; + padding: 8px 12px; +} + +.challenge-verdict.win { + background: #e3f3dc; + color: #2f6b1f; +} + +.challenge-verdict.lose { + background: #f7e6df; + color: #9a3d22; +} + +.challenge-verdict.tie { + background: #eef1ec; + color: #4a5b57; +} + +.challenge-table { + border-collapse: collapse; + font-size: 0.85rem; + width: 100%; +} + +.challenge-table th, +.challenge-table td { + border-bottom: 1px solid #e4e8e2; + padding: 5px 8px; + text-align: right; +} + +.challenge-table th:first-child, +.challenge-table td:first-child { + text-align: left; +} + +.challenge-table thead th { + color: var(--ink-soft, #5a6b67); + font-weight: 700; +} + +.challenge-table td.better { + background: #e3f3dc; + font-weight: 800; +} + .status-strip { display: grid; gap: 10px; diff --git a/docs/playground.html b/docs/playground.html index 0d3a6fa..d52e18e 100644 --- a/docs/playground.html +++ b/docs/playground.html @@ -124,6 +124,14 @@

Playground

+ +