feat(gepa): offline GEPA optimization for readable, comprehensible flowcharts by ivanmkc · Pull Request #189 · ivanmkc/termchart

ivanmkc · 2026-06-18T15:06:33Z

What

An offline GEPA prompt-optimization harness that evolves the brainstorm + generate prompts for termchart flowcharts. Lives entirely under scripts/experiments/gepa-flowchart/ — no shipped package code is modified.

Spec: docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md
Plan: docs/superpowers/plans/2026-06-18-gepa-flowchart-optimization.md

Objective (what it optimizes for)

Each rollout runs topic → brainstorm → generate flow JSON → validate, scored by a three-part metric:

Structural validity — hard gate (reuses the repo's validateContent); invalid → 0, short-circuits before any LLM call.
Geometry/readability (secondary, weight 0.4) — reuses the existing TS geometryReport (edges-over-nodes, overlaps, crossings, low-readability, density) via a tsx bridge.
Comprehension / first-user-experience (PRIMARY, weight 0.6) — a fresh reader-LLM (sees the board for the first time) answers auto-generated, run-frozen reader-questions; a judge scores each for board-supported correctness and flags questions the board can't answer as missing detail/context. That gap feedback is what drives GEPA's reflection toward richer boards.

total = 0.6·comprehension + 0.4·geometry, gated by validity (all weights env-configurable).

Architecture

Python harness (gepa_flowchart/): config, llm (Anthropic SDK, Claude-direct), geometry_bridge (+ validate_flow.ts), render, dataset (12 topics + question freezing), seed_prompts, pipeline, metric, adapter (GEPAAdapter), report, run (CLI).
Standalone gepa 0.1.1; reflection via a Claude (prompt)->str callable.
Models: generation claude-opus-4-8 (env knob to claude-sonnet-4-6), reader claude-sonnet-4-6, judge + reflection claude-opus-4-8. No temperature/top_p/top_k.

Testing

Built TDD across 10 tasks (subagent-driven: fresh implementer + reviewer per task, plus a final whole-branch review). 24 unit tests pass, 1 live-smoke skipped (gated on ANTHROPIC_API_KEY). Two review-driven fixes landed (gallery-example path; judge-JSON degrades safely). Final review: no Critical/Important findings.

Note: this PR builds and unit-tests the harness; the live optimization loop has not been run here (no API key in the build env). Run it with python -m gepa_flowchart.run --smoke once ANTHROPIC_API_KEY is set. Independent of the viewer PRs (#185–#187).

Offline GEPA loop that evolves the brainstorm + generate prompts for termchart flowcharts, optimizing primarily for first-user comprehension (a fresh reader-LLM answers auto-generated reader-questions, judged for board-supported correctness) with geometry/readability as a secondary guardrail. Claude direct via the Anthropic SDK; standalone gepa; structured-content reader surface; smoke-first.

10 TDD tasks: scaffold+config, geometry bridge (tsx), Anthropic LLM wrappers, board serializer, dataset+question freeze, seed prompts+pipeline, three-part metric, GEPAAdapter, run+report, README. Verified gepa API (adapter protocol, EvaluationBatch, optimize signature, reflection callable) and Anthropic SDK patterns against source.

get_client() auto-selects AnthropicVertex when CLAUDE_CODE_USE_VERTEX or GEPA_USE_VERTEX is set (project/region from env, ADC); falls back to the direct API otherwise. Verified end-to-end: smoke run on Vertex (project adk-coding-agents, region global) completed, seed 0.83 -> best 0.85 on the ci-cd val task.

GEPA mutates the brainstorm/generate prompts and routinely inserts literal JSON braces (e.g. {"direction","nodes",...}). pipeline.py used str.format(), which treats those as fields and raises KeyError mid-run (crashed the full run at ~10/150 rollouts). Replace with a placeholder-only replace() that leaves other braces alone. Regression test added.

… harder topics) Diagnosis: comprehension was pinned at ~1.0 (lenient judge), leaving GEPA no gradient. Fixes: - judge scores STRICTLY and board-grounded (0/0.5/1; credit only specifics shown on the board, never world knowledge) - question generator demands specific, detail-probing questions a sketchy board fails - n_questions 5 -> 7 - +8 harder multi-branch topics (saga, oauth-pkce, raft, k8s-sched, tcp, 3ds, blue-green, rate-limiter) -> 20 total Result: comprehension now spreads 0.71-1.00 (was 0.88-1.00); totals 0.69-0.86.

Render each board in a real browser (persistent viewer + Chromium service) and add: - rendered geometry (Playwright DOM): node-pair overlaps, off-canvas nodes, min on-screen font - visual comprehension + visual quality: one Claude-vision call (Opus 4.8 via Vertex) reads the screenshot, answers the frozen questions from pixels, and rates legibility/crowding Metric now blends text+visual comprehension and heuristic+rendered geometry plus visual quality: total = w_comp*mean(text,visual) + w_geom*mean(heuristic,rendered) + w_vq*visual_quality (defaults 0.5/0.3/0.2). Verified end-to-end on Vertex (render -> DOM metrics -> vision). 34 tests.

A --smoke or small --train run was generating questions for all 20 topics before starting; freeze just train∪val so partial runs aren't paying for unused topics.

…lback) The harder question prompt occasionally makes the model emit a valid JSON array followed by trailing prose, which crashed freeze_questions with JSONDecodeError ('Extra data') at the start of a run. Parse the first array via raw_decode (ignore trailing), retry once on a clean miss, and fall back to generic questions so no topic is ever question-less. Regression tests added.

A one-line startup diagnostic (which module file is loaded + whether the tolerant parser is present) — caught a stale-bytecode issue where a run executed pre-fix code despite fixed source.

- Aggregate the three axes (comprehension/geometry/visual_quality) with an eps-floored WEIGHTED HARMONIC MEAN instead of a linear sum: a weak axis can't be bought back by a strong one (anti-compensation), while the eps floor keeps a single 0 from collapsing the score and re-saturating the metric. Within-axis (text+visual, heuristic+rendered) stays arithmetic mean (denoising two estimates of one thing). - Adapter now returns per-objective scores; run uses frontier_type=hybrid so GEPA keeps the Pareto front over both val instances and objectives instead of only the collapsed scalar. Verified on Vertex: base valset 0.415, GEPA iterates with the hybrid frontier, no errors. 37 tests.

…cross-eval - corpus_gen: ~180 diverse use-cases across 20 domains (tolerant parse, dedupe, fallback) - run.py: --topics / --val cap / reuse shared frozen_questions (comparable experiments) - agg toggle (harmonic|linear) for ablation - overnight.sh: robust orchestrator — shared frozen questions, 3 timeout-bounded + failure-isolated experiments (WHM+hybrid opus, linear+instance ablation, sonnet-gen ablation), then crosseval - crosseval: re-score seed + each best on a held-out set under one canonical metric -> SUMMARY.md

…iment Orphaned render services (from kill -9 skipping the graceful handler) hold the fixed render/viewer ports and make the next experiment fail 'render service exited before ready'. cleanup() now also kills stray headless_shell/chrome and runs once at startup.

ivanmkc-google added 24 commits June 18, 2026 04:26

feat(gepa): scaffold gepa-flowchart project + config

e6d8c16

feat(gepa): geometry bridge reusing TS validateContent + geometryReport

abb5f96

feat(gepa): Anthropic SDK wrappers + reflection callable

8a508bb

feat(gepa): board-to-text serializer + JSON extractor

1b16860

feat(gepa): topic dataset + reader-question generation/freezing

30bb7e9

feat(gepa): seed brainstorm/generate prompts + pipeline

6233d5a

fix(gepa): load real gallery example in SKILL_CONTEXT (parents[4])

82d48ac

feat(gepa): three-part metric (validity gate, geometry, comprehension)

c69dc8b

fix(gepa): judge parse degrades safely on malformed JSON

34cf94c

feat(gepa): GEPAAdapter (evaluate + reflective dataset)

28fdcc0

feat(gepa): run entrypoint, report, and smoke test

cce5d9d

docs(gepa): README with setup, run, and cost notes

03fae2b

fix(gepa): freeze reader-questions only for used topics (not all 20)

2e411ea

A --smoke or small --train run was generating questions for all 20 topics before starting; freeze just train∪val so partial runs aren't paying for unused topics.

chore(gepa): log dataset module + parse_str_array at startup

830e68d

A one-line startup diagnostic (which module file is loaded + whether the tolerant parser is present) — caught a stale-bytecode issue where a run executed pre-fix code despite fixed source.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gepa): offline GEPA optimization for readable, comprehensible flowcharts#189

feat(gepa): offline GEPA optimization for readable, comprehensible flowcharts#189
ivanmkc wants to merge 24 commits into
masterfrom
feat/gepa-flowchart-optimization

ivanmkc commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ivanmkc commented Jun 18, 2026

What

Objective (what it optimizes for)

Architecture

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants