feat(gepa): offline GEPA optimization for readable, comprehensible flowcharts#189
Open
ivanmkc wants to merge 24 commits into
Open
feat(gepa): offline GEPA optimization for readable, comprehensible flowcharts#189ivanmkc wants to merge 24 commits into
ivanmkc wants to merge 24 commits into
Conversation
Offline GEPA loop that evolves the brainstorm + generate prompts for termchart flowcharts, optimizing primarily for first-user comprehension (a fresh reader-LLM answers auto-generated reader-questions, judged for board-supported correctness) with geometry/readability as a secondary guardrail. Claude direct via the Anthropic SDK; standalone gepa; structured-content reader surface; smoke-first.
10 TDD tasks: scaffold+config, geometry bridge (tsx), Anthropic LLM wrappers, board serializer, dataset+question freeze, seed prompts+pipeline, three-part metric, GEPAAdapter, run+report, README. Verified gepa API (adapter protocol, EvaluationBatch, optimize signature, reflection callable) and Anthropic SDK patterns against source.
get_client() auto-selects AnthropicVertex when CLAUDE_CODE_USE_VERTEX or GEPA_USE_VERTEX is set (project/region from env, ADC); falls back to the direct API otherwise. Verified end-to-end: smoke run on Vertex (project adk-coding-agents, region global) completed, seed 0.83 -> best 0.85 on the ci-cd val task.
GEPA mutates the brainstorm/generate prompts and routinely inserts literal JSON
braces (e.g. {"direction","nodes",...}). pipeline.py used str.format(), which
treats those as fields and raises KeyError mid-run (crashed the full run at ~10/150
rollouts). Replace with a placeholder-only replace() that leaves other braces alone.
Regression test added.
… harder topics) Diagnosis: comprehension was pinned at ~1.0 (lenient judge), leaving GEPA no gradient. Fixes: - judge scores STRICTLY and board-grounded (0/0.5/1; credit only specifics shown on the board, never world knowledge) - question generator demands specific, detail-probing questions a sketchy board fails - n_questions 5 -> 7 - +8 harder multi-branch topics (saga, oauth-pkce, raft, k8s-sched, tcp, 3ds, blue-green, rate-limiter) -> 20 total Result: comprehension now spreads 0.71-1.00 (was 0.88-1.00); totals 0.69-0.86.
Render each board in a real browser (persistent viewer + Chromium service) and add: - rendered geometry (Playwright DOM): node-pair overlaps, off-canvas nodes, min on-screen font - visual comprehension + visual quality: one Claude-vision call (Opus 4.8 via Vertex) reads the screenshot, answers the frozen questions from pixels, and rates legibility/crowding Metric now blends text+visual comprehension and heuristic+rendered geometry plus visual quality: total = w_comp*mean(text,visual) + w_geom*mean(heuristic,rendered) + w_vq*visual_quality (defaults 0.5/0.3/0.2). Verified end-to-end on Vertex (render -> DOM metrics -> vision). 34 tests.
A --smoke or small --train run was generating questions for all 20 topics before starting; freeze just train∪val so partial runs aren't paying for unused topics.
…lback)
The harder question prompt occasionally makes the model emit a valid JSON array
followed by trailing prose, which crashed freeze_questions with JSONDecodeError
('Extra data') at the start of a run. Parse the first array via raw_decode (ignore
trailing), retry once on a clean miss, and fall back to generic questions so no
topic is ever question-less. Regression tests added.
A one-line startup diagnostic (which module file is loaded + whether the tolerant parser is present) — caught a stale-bytecode issue where a run executed pre-fix code despite fixed source.
- Aggregate the three axes (comprehension/geometry/visual_quality) with an eps-floored WEIGHTED HARMONIC MEAN instead of a linear sum: a weak axis can't be bought back by a strong one (anti-compensation), while the eps floor keeps a single 0 from collapsing the score and re-saturating the metric. Within-axis (text+visual, heuristic+rendered) stays arithmetic mean (denoising two estimates of one thing). - Adapter now returns per-objective scores; run uses frontier_type=hybrid so GEPA keeps the Pareto front over both val instances and objectives instead of only the collapsed scalar. Verified on Vertex: base valset 0.415, GEPA iterates with the hybrid frontier, no errors. 37 tests.
…cross-eval - corpus_gen: ~180 diverse use-cases across 20 domains (tolerant parse, dedupe, fallback) - run.py: --topics / --val cap / reuse shared frozen_questions (comparable experiments) - agg toggle (harmonic|linear) for ablation - overnight.sh: robust orchestrator — shared frozen questions, 3 timeout-bounded + failure-isolated experiments (WHM+hybrid opus, linear+instance ablation, sonnet-gen ablation), then crosseval - crosseval: re-score seed + each best on a held-out set under one canonical metric -> SUMMARY.md
…iment Orphaned render services (from kill -9 skipping the graceful handler) hold the fixed render/viewer ports and make the next experiment fail 'render service exited before ready'. cleanup() now also kills stray headless_shell/chrome and runs once at startup.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
An offline GEPA prompt-optimization harness that evolves the
brainstorm+generateprompts for termchart flowcharts. Lives entirely underscripts/experiments/gepa-flowchart/— no shipped package code is modified.Spec:
docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.mdPlan:
docs/superpowers/plans/2026-06-18-gepa-flowchart-optimization.mdObjective (what it optimizes for)
Each rollout runs
topic → brainstorm → generate flow JSON → validate, scored by a three-part metric:validateContent); invalid → 0, short-circuits before any LLM call.geometryReport(edges-over-nodes, overlaps, crossings, low-readability, density) via atsxbridge.total = 0.6·comprehension + 0.4·geometry, gated by validity (all weights env-configurable).Architecture
gepa_flowchart/):config,llm(Anthropic SDK, Claude-direct),geometry_bridge(+validate_flow.ts),render,dataset(12 topics + question freezing),seed_prompts,pipeline,metric,adapter(GEPAAdapter),report,run(CLI).gepa0.1.1; reflection via a Claude(prompt)->strcallable.claude-opus-4-8(env knob toclaude-sonnet-4-6), readerclaude-sonnet-4-6, judge + reflectionclaude-opus-4-8. Notemperature/top_p/top_k.Testing
Built TDD across 10 tasks (subagent-driven: fresh implementer + reviewer per task, plus a final whole-branch review). 24 unit tests pass, 1 live-smoke skipped (gated on
ANTHROPIC_API_KEY). Two review-driven fixes landed (gallery-example path; judge-JSON degrades safely). Final review: no Critical/Important findings.