Skip to content

feat(gepa): offline GEPA optimization for readable, comprehensible flowcharts#189

Open
ivanmkc wants to merge 24 commits into
masterfrom
feat/gepa-flowchart-optimization
Open

feat(gepa): offline GEPA optimization for readable, comprehensible flowcharts#189
ivanmkc wants to merge 24 commits into
masterfrom
feat/gepa-flowchart-optimization

Conversation

@ivanmkc

@ivanmkc ivanmkc commented Jun 18, 2026

Copy link
Copy Markdown
Owner

What

An offline GEPA prompt-optimization harness that evolves the brainstorm + generate prompts for termchart flowcharts. Lives entirely under scripts/experiments/gepa-flowchart/ — no shipped package code is modified.

Spec: docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md
Plan: docs/superpowers/plans/2026-06-18-gepa-flowchart-optimization.md

Objective (what it optimizes for)

Each rollout runs topic → brainstorm → generate flow JSON → validate, scored by a three-part metric:

  1. Structural validity — hard gate (reuses the repo's validateContent); invalid → 0, short-circuits before any LLM call.
  2. Geometry/readability (secondary, weight 0.4) — reuses the existing TS geometryReport (edges-over-nodes, overlaps, crossings, low-readability, density) via a tsx bridge.
  3. Comprehension / first-user-experience (PRIMARY, weight 0.6) — a fresh reader-LLM (sees the board for the first time) answers auto-generated, run-frozen reader-questions; a judge scores each for board-supported correctness and flags questions the board can't answer as missing detail/context. That gap feedback is what drives GEPA's reflection toward richer boards.

total = 0.6·comprehension + 0.4·geometry, gated by validity (all weights env-configurable).

Architecture

  • Python harness (gepa_flowchart/): config, llm (Anthropic SDK, Claude-direct), geometry_bridge (+ validate_flow.ts), render, dataset (12 topics + question freezing), seed_prompts, pipeline, metric, adapter (GEPAAdapter), report, run (CLI).
  • Standalone gepa 0.1.1; reflection via a Claude (prompt)->str callable.
  • Models: generation claude-opus-4-8 (env knob to claude-sonnet-4-6), reader claude-sonnet-4-6, judge + reflection claude-opus-4-8. No temperature/top_p/top_k.

Testing

Built TDD across 10 tasks (subagent-driven: fresh implementer + reviewer per task, plus a final whole-branch review). 24 unit tests pass, 1 live-smoke skipped (gated on ANTHROPIC_API_KEY). Two review-driven fixes landed (gallery-example path; judge-JSON degrades safely). Final review: no Critical/Important findings.

Note: this PR builds and unit-tests the harness; the live optimization loop has not been run here (no API key in the build env). Run it with python -m gepa_flowchart.run --smoke once ANTHROPIC_API_KEY is set. Independent of the viewer PRs (#185#187).

Offline GEPA loop that evolves the brainstorm + generate prompts for termchart
flowcharts, optimizing primarily for first-user comprehension (a fresh reader-LLM
answers auto-generated reader-questions, judged for board-supported correctness)
with geometry/readability as a secondary guardrail. Claude direct via the
Anthropic SDK; standalone gepa; structured-content reader surface; smoke-first.
10 TDD tasks: scaffold+config, geometry bridge (tsx), Anthropic LLM wrappers,
board serializer, dataset+question freeze, seed prompts+pipeline, three-part
metric, GEPAAdapter, run+report, README. Verified gepa API (adapter protocol,
EvaluationBatch, optimize signature, reflection callable) and Anthropic SDK
patterns against source.
get_client() auto-selects AnthropicVertex when CLAUDE_CODE_USE_VERTEX or
GEPA_USE_VERTEX is set (project/region from env, ADC); falls back to the direct
API otherwise. Verified end-to-end: smoke run on Vertex (project adk-coding-agents,
region global) completed, seed 0.83 -> best 0.85 on the ci-cd val task.
GEPA mutates the brainstorm/generate prompts and routinely inserts literal JSON
braces (e.g. {"direction","nodes",...}). pipeline.py used str.format(), which
treats those as fields and raises KeyError mid-run (crashed the full run at ~10/150
rollouts). Replace with a placeholder-only replace() that leaves other braces alone.
Regression test added.
… harder topics)

Diagnosis: comprehension was pinned at ~1.0 (lenient judge), leaving GEPA no
gradient. Fixes:
- judge scores STRICTLY and board-grounded (0/0.5/1; credit only specifics shown
  on the board, never world knowledge)
- question generator demands specific, detail-probing questions a sketchy board fails
- n_questions 5 -> 7
- +8 harder multi-branch topics (saga, oauth-pkce, raft, k8s-sched, tcp, 3ds,
  blue-green, rate-limiter) -> 20 total
Result: comprehension now spreads 0.71-1.00 (was 0.88-1.00); totals 0.69-0.86.
Render each board in a real browser (persistent viewer + Chromium service) and add:
- rendered geometry (Playwright DOM): node-pair overlaps, off-canvas nodes, min on-screen font
- visual comprehension + visual quality: one Claude-vision call (Opus 4.8 via Vertex) reads the
  screenshot, answers the frozen questions from pixels, and rates legibility/crowding
Metric now blends text+visual comprehension and heuristic+rendered geometry plus visual quality:
total = w_comp*mean(text,visual) + w_geom*mean(heuristic,rendered) + w_vq*visual_quality
(defaults 0.5/0.3/0.2). Verified end-to-end on Vertex (render -> DOM metrics -> vision). 34 tests.
A --smoke or small --train run was generating questions for all 20 topics before
starting; freeze just train∪val so partial runs aren't paying for unused topics.
…lback)

The harder question prompt occasionally makes the model emit a valid JSON array
followed by trailing prose, which crashed freeze_questions with JSONDecodeError
('Extra data') at the start of a run. Parse the first array via raw_decode (ignore
trailing), retry once on a clean miss, and fall back to generic questions so no
topic is ever question-less. Regression tests added.
A one-line startup diagnostic (which module file is loaded + whether the tolerant
parser is present) — caught a stale-bytecode issue where a run executed pre-fix
code despite fixed source.
- Aggregate the three axes (comprehension/geometry/visual_quality) with an eps-floored
  WEIGHTED HARMONIC MEAN instead of a linear sum: a weak axis can't be bought back by a
  strong one (anti-compensation), while the eps floor keeps a single 0 from collapsing the
  score and re-saturating the metric. Within-axis (text+visual, heuristic+rendered) stays
  arithmetic mean (denoising two estimates of one thing).
- Adapter now returns per-objective scores; run uses frontier_type=hybrid so GEPA keeps the
  Pareto front over both val instances and objectives instead of only the collapsed scalar.
Verified on Vertex: base valset 0.415, GEPA iterates with the hybrid frontier, no errors. 37 tests.
…cross-eval

- corpus_gen: ~180 diverse use-cases across 20 domains (tolerant parse, dedupe, fallback)
- run.py: --topics / --val cap / reuse shared frozen_questions (comparable experiments)
- agg toggle (harmonic|linear) for ablation
- overnight.sh: robust orchestrator — shared frozen questions, 3 timeout-bounded + failure-isolated
  experiments (WHM+hybrid opus, linear+instance ablation, sonnet-gen ablation), then crosseval
- crosseval: re-score seed + each best on a held-out set under one canonical metric -> SUMMARY.md
…iment

Orphaned render services (from kill -9 skipping the graceful handler) hold the fixed
render/viewer ports and make the next experiment fail 'render service exited before ready'.
cleanup() now also kills stray headless_shell/chrome and runs once at startup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants